632 research outputs found

    Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters

    Get PDF
    Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5% of task stragglers impact 50% of total jobs for batch processes, and 53% of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11% into their execution lifecycle with 95% accuracy for short duration jobs

    Power Management in Heterogeneous MapReduce Cluster

    Get PDF
    The growing expenses of power in data centers as compared to the operation costs has been a concern for the past several decades. It has been predicted that without an intervention, the energy cost will soon outgrow the infrastructure and operation cost. Therefore, it is of great importance to make data center clusters more energy efficient which is critical for avoiding system overheating and failures. In addition, energy inefficiency causes not only the loss of capital but also environmental pollution. Various Power Management(PM) strategies have been developed over the years to make system more energy efficient and to counteract the sharply rising cost of electricity. However, it is still a challenge to make the system both power efficient and computation efficient due to many underlying system constraints. In this thesis, we investigate the Power Management technique in heterogeneous MapReduce clusters while also maintaining the required system QoS (Quality of Service). For a cluster that supports MapReduce jobs, it is necessary to develop a PM technique that also considers the data availability. We develop our PM strategy by exploiting the fact that the servers in the system are underutilized most of the time. Hence, we first develop a model of our testbed and study how the server utilization levels affect the power consumption and the system throughput. With the established models, we form and solve the power optimization problem for heterogeneous MadReduce clusters where we control the server utilization levels intelligently to minimize the total power consumption. We have conducted simulations and shown the power savings achieved using our PM technique. Then we validate some of our simulation results by running experiments in a real testbed. Our simulation and experimental data have shown that our PM strategy works well for heterogeneous MapReduce clusters which consists of different power efficient and inefficient servers. Adviser: Ying L
    • …
    corecore