11 research outputs found

    ReStore: Reusing Results of MapReduce Jobs

    Full text link
    Analyzing large scale data has emerged as an important activity for many organizations in the past few years. This large scale data analysis is facilitated by the MapReduce programming and execution model and its implementations, most notably Hadoop. Users of MapReduce often have analysis tasks that are too complex to express as individual MapReduce jobs. Instead, they use high-level query languages such as Pig, Hive, or Jaql to express their complex tasks. The compilers of these languages translate queries into workflows of MapReduce jobs. Each job in these workflows reads its input from the distributed file system used by the MapReduce system and produces output that is stored in this distributed file system and read as input by the next job in the workflow. The current practice is to delete these intermediate results from the distributed file system at the end of executing the workflow. One way to improve the performance of workflows of MapReduce jobs is to keep these intermediate results and reuse them for future workflows submitted to the system. In this paper, we present ReStore, a system that manages the storage and reuse of such intermediate results. ReStore can reuse the output of whole MapReduce jobs that are part of a workflow, and it can also create additional reuse opportunities by materializing and storing the output of query execution operators that are executed within a MapReduce job. We have implemented ReStore as an extension to the Pig dataflow system on top of Hadoop, and we experimentally demonstrate significant speedups on queries from the PigMix benchmark.Comment: VLDB201

    Automatic Physical Design for XML Databases

    Get PDF
    Database systems employ physical structures such as indexes and materialized views to improve query performance, potentially by orders of magnitude. It is therefore important for a database administrator to choose the appropriate configuration of these physical structures (i.e., the appropriate physical design) for a given database. Deciding on the physical design of a database is not an easy task, and a considerable amount of research exists on automatic physical design tools for relational databases. Recently, XML database systems are increasingly being used for managing highly structured XML data, and support for XML data is being added to commercial relational database systems. This raises the important question of how to choose the appropriate physical design (i.e., the appropriate set of physical structures) for an XML database. Relational automatic physical design tools are not adequate, so new research is needed in this area. In this thesis, we address the problem of automatic physical design for XML databases, which is the process of automatically selecting the best set of physical structures for a given database and a given query workload representing the client application's usage patterns of this data. We focus on recommending two types of physical structures: XML indexes and relational materialized views of XML data. For each of these structures, we study the recommendation process and present a design advisor that automatically recommends a configuration of physical structures given an XML database and a workload of XML queries. The recommendation process is divided into four main phases: (1) enumerating candidate physical structures, (2) generalizing candidate structures in order to generate more candidates that are useful to queries that are not seen in the given workload but similar to the workload queries, (3) estimating the benefit of various candidate structures, and (4) selecting the best set of candidate structures for the given database and workload. We present a design advisor for recommending XML indexes, one for recommending materialized views, and an integrated design advisor that recommends both indexes and materialized views. A key characteristic of our advisors is that they are tightly coupled with the query optimizer of the database system, and rely on the optimizer for enumerating and evaluating physical designs whenever possible. This characteristic makes our techniques suitable for any database system that complies with a set of minimum requirements listed within the thesis. We have implemented the index, materialized view, and integrated advisors in a prototype version of IBM DB2 V9, which supports both relational and XML data, and we experimentally demonstrate the effectiveness of their recommendations using this implementation

    A Machine Learning Approach for Predicting Execution Time of Spark Jobs

    No full text
    Spark has gained growing attention in the past couple of years as an in-memory cloud computing platform. It supports execution of various types of workloads such as SQL queries and machine learning applications. Currently, many enterprises use Spark to exploit its fast in-memory processing of large scale data. Additionally, speeding up the execution in Spark is an important problem for many real-time applications. This can be achieved by improving the scheduling approaches employed by Spark, optimizing the execution plans generated by Spark for various applications, and selecting the best cluster configuration to run an input workload. A first step for all these optimization approaches is to predict the execution time of an input Spark application. In this paper, we present a new platform that predicts with high accuracy the execution time of SQL queries and machine learning applications executed by Spark. We evaluate our proposed platform by measuring the accuracy of predicting execution time of various types of Spark jobs including TPC-H queries and machine learning classification/clustering applications. The evaluation experiments show that we are able to predict the execution time of Spark jobs using our proposed platform with accuracy greater than 90% for SQL queries and greater than 75% for machine learning jobs. Keywords: Spark, Execution Time Prediction, Machine Learnin
    corecore