8 research outputs found

    Earlier stage for straggler detection and handling using combined CPU test and LATE methodology

    Get PDF
    Using MapReduce in Hadoop helps in lowering the execution time and power consumption for large scale data. However, there can be a delay in job processing in circumstances where tasks are assigned to bad or congested machines called "straggler tasks"; which increases the time, power consumptions and therefore increasing the costs and leading to a poor performance of computing systems. This research proposes a hybrid MapReduce framework referred to as the combinatory late-machine (CLM) framework. Implementation of this framework will facilitate early and timely detection and identification of stragglers thereby facilitating prompt appropriate and effective actions

    An Approach for Modeling and Ranking Node-level Stragglers in Cloud Datacenters

    Get PDF
    The ability of servers to effectively execute tasks within Cloud datacenters varies due to heterogeneous CPU and memory capacities, resource contention situations, network configurations and operational age. Unexpectedly slow server nodes (node-level stragglers) result in assigned tasks becoming task-level stragglers, which dramatically impede parallel job execution. However, it is currently unknown how slow nodes directly correlate to task straggler manifestation. To address this knowledge gap, we propose a method for node performance modeling and ranking in Cloud datacenters based on analyzing parallel job execution tracelog data. By using a production Cloud system as a case study, we demonstrate how node execution performance is driven by temporal changes in node operation as opposed to node hardware capacity. Different sample sets have been filtered in order to evaluate the generality of our framework, and the analytic results demonstrate that node abilities of executing parallel tasks tend to follow a 3-parameter-loglogistic distribution. Further statistical attribute values such as confidence interval, quantile value, extreme case possibility, etc. can also be used for ranking and identifying potential straggler nodes within the cluster. We exploit a graph-based algorithm for partitioning server nodes into five levels, with 0.83% of node-level stragglers identified. Our work lays the foundation towards enhancing scheduling algorithms by avoiding slow nodes, reducing task straggler occurrence, and improving parallel job performance

    Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters

    Get PDF
    Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5% of task stragglers impact 50% of total jobs for batch processes, and 53% of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11% into their execution lifecycle with 95% accuracy for short duration jobs

    Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres

    Get PDF
    Cloud computing systems are splitting compute- and data-intensive jobs into smaller tasks to execute them in a parallel manner using clusters to improve execution time. However, such systems at increasing scale are exposed to stragglers, whereby abnormally slow running tasks executing within a job substantially affect job performance completion. Such stragglers are a direct threat towards attaining fast execution of data-intensive jobs within cloud computing. Researchers have proposed an assortment of different mechanisms, frameworks, and management techniques to detect and mitigate stragglers both proactively and reactively. In this paper, we present a comprehensive review of straggler management techniques within large-scale cloud data centres. We provide a detailed taxonomy of straggler causes, as well as proposed management and mitigation techniques based on straggler characteristics and properties. From this systematic review, we outline several outstanding challenges and potential directions of possible future work for straggler research

    Straggler Detection in Parallel Computing Systems through Dynamic Threshold Calculation

    Get PDF
    Cloud computing systems face the substantial challenge of the Long Tail problem: a small subset of straggling tasks significantly impede parallel jobs completion. This behavior results in longer service response times and degraded system utilization. Speculative execution, which create task replicas at runtime, is a typical method deployed in large-scale distributed systems to tolerate stragglers. This approach defines stragglers by specifying a static threshold value, which calculates the temporal difference between an individual task and the average task progression for a job. However, specifying static threshold debilitates speculation effectiveness as it fails to consider the intrinsic diversity of job timing constraints within modern day Cloud computing systems. Capturing such heterogeneity enables the ability to impose different levels of strictness for replica creation while achieving specified levels of QoS for different application types. Furthermore, a static threshold also fails to consider system environmental constraints in terms of replication overheads and optimal system resource usage. In this paper we present an algorithm for dynamically calculating a threshold value to identify task stragglers, considering key parameters including job QoS timing constraints, task execution characteristics, and optimal system resource utilization. We study and demonstrate the effectiveness of our algorithm through simulating a number of different operational scenarios based on real production cluster data against state-of-the-art solutions. Results demonstrate that our approach is capable of creating 58.62% less replicas under high resource utilization while reducing response time up to 17.86% for idle periods compared to a static threshold

    Improving Academic Natural Language Processing Infrastructures Utilizing Cluster Computation

    Get PDF
    In light of widespread digitization endeavors and ever-growing textual data generation, developing efficient academic Natural Language Processing (NLP) infrastructures, which can deal with large amounts of data, is of particular importance. Novel computation technologies allow tools that support big data and heavy computation while performing timely and cost-effective data processing. This development has led researchers to demand that knowledge be extracted from ever-increasing textual data before it is outdated. Cluster computation is a modern technology for handling big data efficiently. It provides distribution of computing and data over a number of machines in a cluster, as well as efficient use of resources, which are key requirements to process big data in a timely manner. It also assures applications’ high availability and fault tolerance, which are fundamental concerns when dealing with vast amounts of data. In addition, it provides load balancing of data during the execution of tasks, which results in optimal use of resources and enhances efficiency. Data-oriented parallelization is an effective solution to enable the currently available academic NLP infrastructures to process big data. This approach offers a solution to parallelize the NLP tools which comprise identical non-complicated tasks without the expense of changing NLP algorithms. This thesis presents the adaption of cluster computation technology to academic NLP infrastructures to address the notable features that are essential to process vast quantities of text materials efficiently, in terms of both resources and time. Apache Spark on top of Apache Hadoop and its ecosystem have been utilized to develop a set of NLP tools that provide a distributed environment to execute the NLP tasks. Many experiments were conducted to assess the functionality of the designated strategy. This thesis shows that using cluster computation technology and data-oriented parallelization enables academic NLP infrastructures to execute large amounts of textual data in a timely manner while improving the performance of the NLP tools. Moreover, these experiments provide information that brings a more realistic and transparent estimation of workflows’ costs (required hardware resources) and execution time, along with the fastest, optimum, or feasible resource configuration for each individual workflow. This knowledge can be employed by users to trade-off between run-time, size of data, and hardware, and it enables them to design a strategy for data storage, duration of data retention, and delivery time. This has the potential to enhance researchers’ satisfaction when using academic NLP infrastructures. The thesis also shows that a cluster computation approach provides the capacity to adapt NLP services with JIT delivery systems. The proposed strategy assures the reliability and predictability of the services, which are the main characteristics of the services in JIT delivery systems. Defining the relevant parameters, recording the behavior of the services, and analyzing the generated data resulted in the provision of knowledge that can be utilized to create a service catalog—a fundamental requirement for the services in JIT delivery systems—for each service offered. This knowledge also helps to generate the performance profiles for each item mentioned in the service catalog and to update them continuously to cover new experiments and improve service quality

    A prescriptive analytics approach for energy efficiency in datacentres.

    Get PDF
    Given the evolution of Cloud Computing in recent years, users and clients adopting Cloud Computing for both personal and business needs have increased at an unprecedented scale. This has naturally led to the increased deployments and implementations of Cloud datacentres across the globe. As a consequence of this increasing adoption of Cloud Computing, Cloud datacentres are witnessed to be massive energy consumers and environmental polluters. Whilst the energy implications of Cloud datacentres are being addressed from various research perspectives, predicting the future trend and behaviours of workloads at the datacentres thereby reducing the active server resources is one particular dimension of green computing gaining the interests of researchers and Cloud providers. However, this includes various practical and analytical challenges imposed by the increased dynamism of Cloud systems. The behavioural characteristics of Cloud workloads and users are still not perfectly clear which restrains the reliability of the prediction accuracy of existing research works in this context. To this end, this thesis presents a comprehensive descriptive analytics of Cloud workload and user behaviours, uncovering the cause and energy related implications of Cloud Computing. Furthermore, the characteristics of Cloud workloads and users including latency levels, job heterogeneity, user dynamicity, straggling task behaviours, energy implications of stragglers, job execution and termination patterns and the inherent periodicity among Cloud workload and user behaviours have been empirically presented. Driven by descriptive analytics, a novel user behaviour forecasting framework has been developed, aimed at a tri-fold forecast of user behaviours including the session duration of users, anticipated number of submissions and the arrival trend of the incoming workloads. Furthermore, a novel resource optimisation framework has been proposed to avail the most optimum level of resources for executing jobs with reduced server energy expenditures and job terminations. This optimisation framework encompasses a resource estimation module to predict the anticipated resource consumption level for the arrived jobs and a classification module to classify tasks based on their resource intensiveness. Both the proposed frameworks have been verified theoretically and tested experimentally based on Google Cloud trace logs. Experimental analysis demonstrates the effectiveness of the proposed framework in terms of the achieved reliability of the forecast results and in reducing the server energy expenditures spent towards executing jobs at the datacentres.N/