6 research outputs found

    Editorial for FGCS Special issue on “Time-critical Applications on Software-defined Infrastructures”

    Get PDF
    Performance requirements in many applications can often be modelled as constraints related to time, for example, the span of data processing for disaster early warning [1], latency in live event broadcasting [2], and jitter during audio/video conferences [3]. These time constraints are often treated either in an “as fast as possible” manner, such as sensitive latencies in high-performance computing or communication tasks, or in a “timeliness” way where tasks have to be finished within a given window in real-time systems, as classified in [4]. To meet the required time constraints, one has to carefully analyse time constraints, engineer and integrate system components, and optimise the scheduling for computing and communication tasks. The development of a time-critical application is thus time-consuming and costly. During the past decades, the infrastructure technologies of computing, storage and networking have made tremendous progress. Besides the capacity and performance of physical devices, the virtualisation technologies offer effective resource management and isolation at different levels, such as Java Virtual Machines at the application level, Dockers at the operating system level, and Virtual Machines at the whole system level. Moreover, the network embedding [5] and software-defined networking [6] provide network-level virtualisation and control that enable a new paradigm of infrastructure, where infrastructure resources can be virtualised, isolated, and dynamically customised based on application needs. The software-defined infrastructures, including Cloud, Fog, Edge, software-defined networking and network function virtualisation, emerge nowadays as new environments for distributed applications with time-critical application requirements, but also face challenges in effectively utilising the advanced infrastructure features in system engineering and dynamic control. This special issue on “time-critical applications and software-defined infrastructures” focuses on practical aspects of the design, development, customisation and performance-oriented operation of such applications for Clouds and other distributed environments

    Straggler mitigation in hadoop mapreduce framework: a review

    Get PDF
    Processing huge and complex data to obtain useful information is challenging, even though several big data processing frameworks have been proposed and further enhanced. One of the prominent big data processing frameworks is MapReduce. The main concept of MapReduce framework relies on distributed and parallel processing. However, MapReduce framework is facing serious performance degradations due to the slow execution of certain tasks type called stragglers. Failing to handle stragglers causes delay and affects the overall job execution time. Meanwhile, several straggler reduction techniques have been proposed to improve the MapReduce performance. This study provides a comprehensive and qualitative review of the different existing straggler mitigation solutions. In addition, a taxonomy of the available straggler mitigation solutions is presented. Critical research issues and future research directions are identified and discussed to guide researchers and scholars

    Intelligent Straggler Mitigation in Massive-Scale Computing Systems

    Get PDF
    In order to satisfy increasing demands for Cloud services, modern computing systems are often massive in scale, typically consisting of hundreds to thousands of heterogeneous machine nodes. Parallel computing frameworks such as MapReduce are widely deployed over such cluster infrastructure to provide reliable yet prompt services to customers. However, complex characteristics of Cloud workloads, including multi-dimensional resource requirements and highly changeable system environments, e.g. dynamic node performance, are introducing new challenges to service providers in terms of both customer experience and system efficiency. One primary challenge is the straggler problem, whereby a small subset of the parallelized tasks take abnormally longer execution time in comparison with the siblings, leading to extended job response and potential late-timing failure. The state-of-the-art approach to straggler mitigation is speculative execution. Although it has been deployed in several real-world systems with a variety of implementation optimizations, the analysis from this thesis has shown that speculative execution is often inefficient. According to various production tracelogs of data centers, the failure rate of speculative execution could be as high as 71%. Straggler mitigation is a complicated problem in its own nature: 1) stragglers may lead to different consequences to parallel job execution, possibly with different degrees of severity, 2) whether a task should be regarded as a straggler is highly subjective, depending upon different application and system conditions, 3) the efficiency of speculative execution would be improved if dynamic node performance could be modelled and predicted appropriately, and 4) there are other types of stragglers, e.g. those caused by data skews, that are beyond the capability of speculative execution. This thesis starts with a quantitative and rigorous analysis of issues with stragglers, including their root-causes and impacts, the execution environment running them, and the limitations to their mitigation. Scientific principles of straggler mitigation are investigated and new algorithms are developed. An intelligent system for straggler mitigation is then designed and developed, being compatible with the majority of current parallel computing frameworks. Combined with historical data analysis and online adaptation, the system is capable of mitigating stragglers intelligently, dynamically judging a task as a straggler and handling it, avoiding current weak nodes, and dealing with data skew, a special type of straggler, with a dedicated method. Comprehensive analysis and evaluation of the system show that it is able to reduce job response time by up to 55%, as compared with the speculator used in the default YARN system, while the optimal improvement a speculative-based method may achieve is around 66% in theory. The system also achieves a much higher success rate of speculation than other production systems, up to 89%

    Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres

    Get PDF
    Cloud computing systems are splitting compute- and data-intensive jobs into smaller tasks to execute them in a parallel manner using clusters to improve execution time. However, such systems at increasing scale are exposed to stragglers, whereby abnormally slow running tasks executing within a job substantially affect job performance completion. Such stragglers are a direct threat towards attaining fast execution of data-intensive jobs within cloud computing. Researchers have proposed an assortment of different mechanisms, frameworks, and management techniques to detect and mitigate stragglers both proactively and reactively. In this paper, we present a comprehensive review of straggler management techniques within large-scale cloud data centres. We provide a detailed taxonomy of straggler causes, as well as proposed management and mitigation techniques based on straggler characteristics and properties. From this systematic review, we outline several outstanding challenges and potential directions of possible future work for straggler research

    Mitigating stragglers to avoid QoS violation for time-critical applications through dynamic server blacklisting

    No full text
    The straggler problem is one of the most challenging issues toward rapid and predictable response time for applications in cluster infrastructures, leading to potential QoS violation and late-timing failure. Straggler tasks occur due to reasons such as resource contention, hardware heterogeneity, etc., and become severe with increased system scale and complexity. Speculative execution and blacklisting are the major two straggler tolerant techniques, but each has its own limitations. The former creates replica task to catch up with the identified straggler, but normally with no selection toward nodes when deciding where to launch the backup. Ignoring server performance hinders the speculation success rate. The latter typically relies on manual configuration, despite the fact that the ability of nodes to effectively execute tasks changes over time. In addition, the misidentification of weak-performance nodes decreases system capacity. Combining these two techniques, we present DSB, a dynamic server blacklisting framework which takes into account both historical and current behavior of a server node to increase straggler mitigation effectiveness. Servers are ranked at each time interval according to their performance in fulfilling jobs instead of their physical capacities, and the worst performed ones got temporarily blacklisted. As a result, no new tasks/replications are assigned to those straggler-prone nodes within the following time window. DSB also provides an alternative API where adjustable top k worst nodes can be blacklisted according to the ranking. The optimal k is investigated as a trade-off between capacity loss and straggler mitigation efficiency. Results show that, the DSB scheme is capable of increasing successful speculation rate up to 89%. In addition, it can improve job completion time by up to 55.43% compared to the default speculator in the YARN platform. This helps to reduce the chance of QoS violation, which is particularly important for time-critical applications
    corecore