22,610 research outputs found

    ReStore: Reusing Results of MapReduce Jobs

    Full text link
    Analyzing large scale data has emerged as an important activity for many organizations in the past few years. This large scale data analysis is facilitated by the MapReduce programming and execution model and its implementations, most notably Hadoop. Users of MapReduce often have analysis tasks that are too complex to express as individual MapReduce jobs. Instead, they use high-level query languages such as Pig, Hive, or Jaql to express their complex tasks. The compilers of these languages translate queries into workflows of MapReduce jobs. Each job in these workflows reads its input from the distributed file system used by the MapReduce system and produces output that is stored in this distributed file system and read as input by the next job in the workflow. The current practice is to delete these intermediate results from the distributed file system at the end of executing the workflow. One way to improve the performance of workflows of MapReduce jobs is to keep these intermediate results and reuse them for future workflows submitted to the system. In this paper, we present ReStore, a system that manages the storage and reuse of such intermediate results. ReStore can reuse the output of whole MapReduce jobs that are part of a workflow, and it can also create additional reuse opportunities by materializing and storing the output of query execution operators that are executed within a MapReduce job. We have implemented ReStore as an extension to the Pig dataflow system on top of Hadoop, and we experimentally demonstrate significant speedups on queries from the PigMix benchmark.Comment: VLDB201

    Reliable scientific service compositions

    Get PDF
    Abstract. Distributed service oriented architectures (SOAs) are increas-ingly used by users, who are insufficiently skilled in the art of distributed system programming. A good example are computational scientists who build large-scale distributed systems using service-oriented Grid comput-ing infrastructures. Computational scientists use these infrastructure to build scientific applications, which are composed from basic Web ser-vices into larger orchestrations using workflow languages, such as the Business Process Execution Language. For these users reliability of the infrastructure is of significant importance and that has to be provided in the presence of hardware or operational failures. The primitives avail-able to achieve such reliability currently leave much to be desired by users who do not necessarily have a strong education in distributed sys-tem construction. We characterise scientific service compositions and the environment they operate in by introducing the notion of global scien-tific BPEL workflows. We outline the threats to the reliability of such workflows and discuss the limited support that available specifications and mechanisms provide to achieve reliability. Furthermore, we propose a line of research to address the identified issues by investigating auto-nomic mechanisms that assist computational scientists in building, exe-cuting and maintaining reliable workflows.

    On the construction of decentralised service-oriented orchestration systems

    Get PDF
    Modern science relies on workflow technology to capture, process, and analyse data obtained from scientific instruments. Scientific workflows are precise descriptions of experiments in which multiple computational tasks are coordinated based on the dataflows between them. Orchestrating scientific workflows presents a significant research challenge: they are typically executed in a manner such that all data pass through a centralised computer server known as the engine, which causes unnecessary network traffic that leads to a performance bottleneck. These workflows are commonly composed of services that perform computation over geographically distributed resources, and involve the management of dataflows between them. Centralised orchestration is clearly not a scalable approach for coordinating services dispersed across distant geographical locations. This thesis presents a scalable decentralised service-oriented orchestration system that relies on a high-level data coordination language for the specification and execution of workflows. This system’s architecture consists of distributed engines, each of which is responsible for executing part of the overall workflow. It exploits parallelism in the workflow by decomposing it into smaller sub-workflows, and determines the most appropriate engines to execute them using computation placement analysis. This permits the workflow logic to be distributed closer to the services providing the data for execution, which reduces the overall data transfer in the workflow and improves its execution time. This thesis provides an evaluation of the presented system which concludes that decentralised orchestration provides scalability benefits over centralised orchestration, and improves the overall performance of executing a service-oriented workflow

    A Meta-Brokering Framework for Science Gateways

    Get PDF
    Recently scientific communities produce a growing number of computation-intensive applications, which calls for the interoperation of distributed infrastructures including Clouds, Grids and private clusters. The European SHIWA and ER-flow projects have enabled the combination of heterogeneous scientific workflows, and their execution in a large-scale system consisting of multiple Distributed Computing Infrastructures. One of the resource management challenges of these projects is called parameter study job scheduling. A parameter study job of a workflow generally has a large number of input files to be consumed by independent job instances. In this paper we propose a meta-brokering framework for science gateways to support the execution of such workflows. In order to cope with the high uncertainty and unpredictable load of the utilized distributed infrastructures, we introduce the so called resource priority services. These tools are capable of determining and dynamically updating priorities of the available infrastructures to be selected for job instances. Our evaluations show that this approach implies an efficient distribution of job instances among the available computing resources resulting in shorter makespan for parameter study workflows

    Data Placement And Task Mapping Optimization For Big Data Workflows In The Cloud

    Get PDF
    Data-centric workflows naturally process and analyze a huge volume of datasets. In this new era of Big Data there is a growing need to enable data-centric workflows to perform computations at a scale far exceeding a single workstation\u27s capabilities. Therefore, this type of applications can benefit from distributed high performance computing (HPC) infrastructures like cluster, grid or cloud computing. Although data-centric workflows have been applied extensively to structure complex scientific data analysis processes, they fail to address the big data challenges as well as leverage the capability of dynamic resource provisioning in the Cloud. The concept of “big data workflows” is proposed by our research group as the next generation of data-centric workflow technologies to address the limitations of exist-ing workflows technologies in addressing big data challenges. Executing big data workflows in the Cloud is a challenging problem as work-flow tasks and data are required to be partitioned, distributed and assigned to the cloud execution sites (multiple virtual machines). In running such big data work-flows in the cloud distributed across several physical locations, the workflow execution time and the cloud resource utilization efficiency highly depends on the initial placement and distribution of the workflow tasks and datasets across the multiple virtual machines in the Cloud. Several workflow management systems have been developed for scientists to facilitate the use of workflows; however, data and work-flow task placement issue has not been sufficiently addressed yet. In this dissertation, I propose BDAP strategy (Big Data Placement strategy) for data placement and TPS (Task Placement Strategy) for task placement, which improve workflow performance by minimizing data movement across multiple virtual machines in the Cloud during the workflow execution. In addition, I propose CATS (Cultural Algorithm Task Scheduling) for workflow scheduling, which improve workflow performance by minimizing workflow execution cost. In this dissertation, I 1) formalize data and task placement problems in workflows, 2) propose a data placement algorithm that considers both initial input dataset and intermediate datasets obtained during workflow run, 3) propose a task placement algorithm that considers placement of workflow tasks before workflow run, 4) propose a workflow scheduling strategy to minimize the workflow execution cost once the deadline is provided by user and 5)perform extensive experiments in the distributed environment to validate that our proposed strategies provide an effective data and task placement solution to distribute and place big datasets and tasks into the appropriate virtual machines in the Cloud within reasonable time

    Using an Actor Framework for Scientific Computing: Opportunities and Challenges

    Get PDF
    We examine the challenges and advantages of using an actor framework for programming and execution of scientific workflows. The following specific topics are studied: implementing workflow semantics and typical workflow patterns in the actor model, parallel and distributed execution of workflow activities using actors, leveraging event sourcing as a novel approach for workflow state persistence and recovery, and applying supervision as a fault tolerance model for workflows. In order to practically validate our research, we have created Scaflow, an Akka-based programming library and workflow execution engine. We study an example workflow implemented in Scaflow, and present experimental measurements of workflow persistence overhead

    A Scientific Workflow System For Genomic Data Analysis

    Get PDF
    Scientific workflows have become increasingly popular as a new computing paradigm for scientists to design and execute complex and distributed scientific processes to enable and accelerate many scientific discoveries. Although several scientific workflow management systems (SWFMSs) have been developed, there is a great need for an integrated scientific workflow system that enables the design and execution of higher-level scientific workflows, which integrate heterogeneous scientific workflows enacted by existing SWFMSs. On one hand, science is becoming increasingly collaborative today, requiring an integrated solution that combines the features and capabilities of different SWFMSs, which are typically developed and optimized towards one single discipline. One the other hand, such an integrated environment can immediately leverage existing and emerging techniques and strengths of various SWFMSs and their supported execution environments, such as Cluster, Grid, and Cloud. The main contributions of this dissertation are: 1) We propose a scientific workflow system, called GENOMEFLOW, to design, develop, and execute higher-level scientific workflows, whose workflow tasks are themselves scientific workflows enacted by existing SWFMSs; 2) We propose a workflow scheduling algorithm, called GSA, to enable the parallel execution of such heterogeneous scientific workflows in their native heterogeneous environments; and 3) We implemented GENOMEFLOW towards the life science community and developed several GENOMEFLOW scientific workflows to demonstrate the capabilities of our system for genome data analysis applications
    corecore