219 research outputs found

    Performance optimization of big data computing workflows for batch and stream data processing in multi-clouds

    Get PDF
    Workflow techniques have been widely used as a major computing solution in many science domains. With the rapid deployment of cloud infrastructures around the globe and the economic benefits of cloud-based computing and storage services, an increasing number of scientific workflows have migrated or are in active transition to clouds. As the scale of scientific applications continues to grow, it is now common to deploy various data- and network-intensive computing workflows such as serial computing workflows, MapReduce/Spark-based workflows, and Storm-based stream data processing workflows in multi-cloud environments, where inter-cloud data transfer oftentimes plays a significant role in both workflow performance and financial cost. Rigorous mathematical models are constructed to analyze the intra- and inter-cloud execution process of scientific workflows and a class of budget-constrained workflow mapping problems are formulated to optimize the network performance of big data workflows in multi-cloud environments. Research shows that these problems are all NP-complete and a heuristic solution is designed for each that takes into consideration module execution, data transfer, and I/O operations. The performance superiority of the proposed solutions over existing methods are illustrated through extensive simulations and further verified by real-life workflow experiments deployed in public clouds

    Performance optimization and energy efficiency of big-data computing workflows

    Get PDF
    Next-generation e-science is producing colossal amounts of data, now frequently termed as Big Data, on the order of terabyte at present and petabyte or even exabyte in the predictable future. These scientific applications typically feature data-intensive workflows comprised of moldable parallel computing jobs, such as MapReduce, with intricate inter-job dependencies. The granularity of task partitioning in each moldable job of such big data workflows has a significant impact on workflow completion time, energy consumption, and financial cost if executed in clouds, which remains largely unexplored. This dissertation conducts an in-depth investigation into the properties of moldable jobs and provides an experiment-based validation of the performance model where the total workload of a moldable job increases along with the degree of parallelism. Furthermore, this dissertation conducts rigorous research on workflow execution dynamics in resource sharing environments and explores the interactions between workflow mapping and task scheduling on various computing platforms. A workflow optimization architecture is developed to seamlessly integrate three interrelated technical components, i.e., resource allocation, job mapping, and task scheduling. Cloud computing provides a cost-effective computing platform for big data workflows where moldable parallel computing models are widely applied to meet stringent performance requirements. Based on the moldable parallel computing performance model, a big-data workflow mapping model is constructed and a workflow mapping problem is formulated to minimize workflow makespan under a budget constraint in public clouds. This dissertation shows this problem to be strongly NP-complete and designs i) a fully polynomial-time approximation scheme for a special case with a pipeline-structured workflow executed on virtual machines of a single class, and ii) a heuristic for a generalized problem with an arbitrary directed acyclic graph-structured workflow executed on virtual machines of multiple classes. The performance superiority of the proposed solution is illustrated by extensive simulation-based results in Hadoop/YARN in comparison with existing workflow mapping models and algorithms. Considering that large-scale workflows for big data analytics have become a main consumer of energy in data centers, this dissertation also delves into the problem of static workflow mapping to minimize the dynamic energy consumption of a workflow request under a deadline constraint in Hadoop clusters, which is shown to be strongly NP-hard. A fully polynomial-time approximation scheme is designed for a special case with a pipeline-structured workflow on a homogeneous cluster and a heuristic is designed for the generalized problem with an arbitrary directed acyclic graph-structured workflow on a heterogeneous cluster. This problem is further extended to a dynamic version with deadline-constrained MapReduce workflows to minimize dynamic energy consumption in Hadoop clusters. This dissertation proposes a semi-dynamic online scheduling algorithm based on adaptive task partitioning to reduce dynamic energy consumption while meeting performance requirements from a global perspective, and also develops corresponding system modules for algorithm implementation in the Hadoop ecosystem. The performance superiority of the proposed solutions in terms of dynamic energy saving and deadline missing rate is illustrated by extensive simulation results in comparison with existing algorithms, and further validated through real-life workflow implementation and experiments using the Oozie workflow engine in Hadoop/YARN systems

    A methodological framework for cloud resource provisioning and scheduling of data parallel applications under uncertainty

    Get PDF
    Data parallel applications are being extensively deployed in cloud environmentsbecause of the possibility of dynamically provisioning storage and computation re-sources. To identify cost-effective solutions that satisfy the desired service levels,resource provisioning and scheduling play a critical role. Nevertheless, the unpre-dictable behavior of cloud performance makes the estimation of the resources actu-ally needed quite complex. In this paper we propose a provisioning and schedulingframework that explicitly tackles uncertainties and performance variability of thecloud infrastructure and of the workload. This framework allows cloud users to es-timate in advance, i.e., prior to the actual execution of the applications, the resourcesettings that cope with uncertainty. We formulate an optimization problem wherethe characteristics not perfectly known or affected by uncertain phenomena arerepresented as random variables modeled by the corresponding probability distri-butions. Provisioning and scheduling decisions \u2013 while optimizing various metrics,such as monetary leasing costs of cloud resources and application execution time \u2013take fully account of uncertainties encountered in cloud environments. To test our framework, we consider data parallel applications characterized by a deadline con-straint and we investigate the impact of their characteristics and of the variabilityof the cloud infrastructure. The experiments show that the resource provisioningand scheduling plans identified by our approach nicely cope with uncertainties andensure that the application deadline is satisfied

    Serverless Computing and Scheduling Tasks on Cloud: A Review

    Get PDF
    Recently, the emergence of Function-as-a-Service (FaaS) has gained increasing attention by researchers. FaaS, also known as serverless computing, is a new concept in cloud computing that allows the services computation that triggers the code execution as a response for certain events. In this paper, we discuss various proposals related to scheduling tasks in clouds. These proposals are categorized according to their objective functions, namely minimizing execution time, minimizing execution cost, or multi objectives (time and cost). The dependency relationships between the tasks plays a vital role in determining the efficiency of the scheduling approach. This dependency may result in resources underutilization. FaaS is expected to have a significant impact on the process of scheduling tasks. This problem can be reduced by adopting a hybrid approach that combines both the benefit of FaaS and Infrastructure-as-a-Service (IaaS). Using FaaS, we can run the small tasks remotely and focus only on scheduling the large tasks. This helps in increasing the utilization of the resources because the small tasks will not be considered during the process of scheduling. An extension of the restricted time limit by cloud vendors will allow running the complete workflow using the serverless architecture, avoiding the scheduling problem
    • …
    corecore