3 research outputs found

    Adaptive query execution for data management in the cloud

    Get PDF
    A major component of many cloud services is query processing on data stored in the underlying cloud cluster. The traditional techniques for query processing on a cluster are those offered by parallel DBMS. These techniques however, cannot guarantee high performance for cloud; parallel DBMS lack adequate fault tolerance mechanisms in order to deal with non-negligible software and hardware failures. MapReduce, on the other hand, allows query processing solutions that are fault tolerant, but imposes substantial overheads. In this paper, we propose an adaptive software architecture which can effortlessly switch between MapReduce and parallel DBMS in order to efficiently process queries regardless of their response times. Switching between the two architectures is performed in a transparent manner based on an intuitive cost model, which computes the expected execution time in presence of failures. The experimental results show that the adaptive architecture achieves the lowest possible query execution time for various scenarios

    Schema Migration from Relational Databases to NoSQL Databases with Graph Transformation and Selective Denormalization

    Get PDF
    We witnessed a dramatic increase in the volume, variety and velocity of data leading to the era of big data. The structure of data has become highly flexible leading to the development of many storage systems that are different from the traditional structured relational databases where data is stored in “tables,” with columns representing the lowest granularity of data. Although relational databases are still predominant in the industry, there has been a major drift towards alternative database systems that support unstructured data with better scalability leading to the popularity of “Not Only SQL.” Migration from relational databases to NoSQL databases has become a significant area of interest when it involves enormous volumes of data with a large number of concurrent users. Many migration methodologies have been proposed each focusing a specific NoSQL family. This paper proposes a heuristics based graph transformation method to migrate a relational database to MongoDB called Graph Transformation with Selective Denormalization and compares the migration with a table level denormalization method. Although this paper focuses on MongoDB, the heuristics algorithm is generalized enough to be applied to other NoSQL families. Experimental evaluation with TPC-H shows that Graph Transformation with Selective Denormalization migration method has lower query execution times with lesser hardware footprint like lower space requirement, disk I/O, CPU utilization compared to that of table level denormalization

    Runtime Prediction for Scale-Out Data Analytics

    Get PDF
    Many analytics applications generate mixed workloads, i.e., workloads comprised of analytical tasks with different processing characteristics including data pre-processing, SQL, and iterative machine learning algorithms. Examples of such mixed workloads can be found in web data analysis, social media analysis, and graph analytics, where they are executed repetitively on large input datasets (e.g., "Find the average user time spent on the top 10 most popular web pages on the UK domain web graph."). Scale-out processing engines satisfy the needs of these applications by distributing the data and the processing task efficiently among multiple workers that are first reserved and then used to execute the task in parallel on a cluster of machines. Finding the resource allocation that can complete the workload execution within a given time constraint, and optimizing cluster resource allocations among multiple analytical workloads motivates the need for estimating the runtime of the workload before its actual execution. Predicting runtime of analytical workloads is a challenging problem as runtime depends on a large number of factors that are hard to model a priori execution. These factors can be summarized as workload characteristics (i.e., data statistics and processing costs), the execution configuration (i.e., deployment, resource allocation, and software settings), and the cost model that captures the interplay among all of the above parameters. While conventional cost models proposed in the context of query optimization can assess the relative order among alternative SQL query plans, they are not aimed to estimate absolute runtime. Additionally, conventional models are ill-equipped to estimate the runtime of iterative analytics that are executed repetitively until convergence and that of user defined data pre-processing operators which are not "owned" by the underlying data management system. This thesis demonstrates that runtime for data analytics can be predicted accurately by breaking the analytical tasks into multiple processing phases, collecting key input features during a reference execution on a sample of the dataset, and then using the features to build per-phase cost models. We develop prediction models for three categories of data analytics produced by social media applications: iterative machine learning, data pre-processing, and reporting SQL. The prediction framework for iterative analytics, PREDIcT, addresses the challenging problem of estimating the number of iterations, and per-iteration runtime for a class of iterative machine learning algorithms that are run repetitively until convergence. The hybrid prediction models we develop for data pre-processing tasks and for reporting SQL combine the benefits of analytical modeling with that of machine learning-based models. Through a training methodology and a pruning algorithm we reduce the cost of running training queries to a minimum while maintaining a good level of accuracy for the models
    corecore