106,078 research outputs found
A Deep Dive into the Google Cluster Workload Traces: Analyzing the Application Failure Characteristics and User Behaviors
Large-scale cloud data centers have gained popularity due to their high
availability, rapid elasticity, scalability, and low cost. However, current
data centers continue to have high failure rates due to the lack of proper
resource utilization and early failure detection. To maximize resource
efficiency and reduce failure rates in large-scale cloud data centers, it is
crucial to understand the workload and failure characteristics. In this paper,
we perform a deep analysis of the 2019 Google Cluster Trace Dataset, which
contains 2.4TiB of workload traces from eight different clusters around the
world. We explore the characteristics of failed and killed jobs in Google's
production cloud and attempt to correlate them with key attributes such as
resource usage, job priority, scheduling class, job duration, and the number of
task resubmissions. Our analysis reveals several important characteristics of
failed jobs that contribute to job failure and hence, could be used for
developing an early failure prediction system. Also, we present a novel usage
analysis to identify heterogeneity in jobs and tasks submitted by users. We are
able to identify specific users who control more than half of all collection
events on a single cluster. We contend that these characteristics could be
useful in developing an early job failure prediction system that could be
utilized for dynamic rescheduling of the job scheduler and thus improving
resource utilization in large-scale cloud data centers while reducing failure
rates
Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters
Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5% of task stragglers impact 50% of total jobs for batch processes, and 53% of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11% into their execution lifecycle with 95% accuracy for short duration jobs
High-Throughput Computing on High-Performance Platforms: A Case Study
The computing systems used by LHC experiments has historically consisted of
the federation of hundreds to thousands of distributed resources, ranging from
small to mid-size resource. In spite of the impressive scale of the existing
distributed computing solutions, the federation of small to mid-size resources
will be insufficient to meet projected future demands. This paper is a case
study of how the ATLAS experiment has embraced Titan---a DOE leadership
facility in conjunction with traditional distributed high- throughput computing
to reach sustained production scales of approximately 52M core-hours a years.
The three main contributions of this paper are: (i) a critical evaluation of
design and operational considerations to support the sustained, scalable and
production usage of Titan; (ii) a preliminary characterization of a next
generation executor for PanDA to support new workloads and advanced execution
modes; and (iii) early lessons for how current and future experimental and
observational systems can be integrated with production supercomputers and
other platforms in a general and extensible manner
- …