10 research outputs found

    Big Data Technologies: Additional Features or Replacement for Traditional Data Management Systems?

    Get PDF
    With the data volume that does not stop growing and the multitude of sources that led to diversity of structures, the classic tools of data management became unsuitable for processing and unable to offer effective tools for information retrieval and knowledge management. Thereby, a major challenge has become how to deal with the explosion of data to transform it into new useful and interesting knowledge. Despite the rapid development and change of the databases world, this data management systems diversity presents a difficulty in choosing the best solution to analyze, interpret and manage data according to the user’s needs while preserving data availability. Hence, the imposition of the Big Data in our technological landscape offers new solutions for data processing. In this work, we aim to present a brief of the current buzz research field called Big Data. Then, we provide a broad comparison of two data management technologies

    In-memory caching for multi-query optimization of data-intensive scalable computing workloads

    Get PDF
    In modern large-scale distributed systems, analytics jobs submitted by various users often share similar work. Instead of optimizing jobs independently, multi-query optimization techniques can be employed to save a considerable amount of cluster resources. In this work, we introduce a novel method combining in-memory cache primitives and multi-query optimization, to improve the efficiency of data-intensive, scalable computing frameworks. By careful selection and exploitation of common (sub) expressions, while satisfying memory constraints, our method transforms a batch of queries into a new, more efficient one which avoids unnecessary recomputations. To find feasible and efficient execution plans, our method uses a cost-based optimization formulation akin to the multiple-choice knapsack problem. Experiments on a prototype implementation of our system show significant benefits of worksharing for TPC-DS workloads

    Optimization Techniques for Complex Multi-query Applications

    Get PDF

    Studying the effect of multi-query functionality on a correlation-aware SQL-to-mapreduce translator in Hadoop version 2

    Get PDF
    The advent of big data has prompted both the industry and research for numerous solutions in catering to the need for data with high volume, veracity, velocity and variety properties. The notion of ever increasing data was initially publicized in 1944 by Fremont Rider, who argued that the libraries in American Universities are doubling in size every sixteen years (Press, 2013). Then, when the digital storage era came to be, it became easier than ever to store and manage large volumes of data. The need for efficient big data systems is now further fueled by the Internet of Things as it opens floodgates for, never before seen, new information flow. ^ These phenomena have called for a simpler and more scalable environment with high fault tolerance and control over availability. With that motivation in mind, and as an alternative to relational databases, numerous Not-Only Structured Query Language (NoSQL) databases were conceived. Nonetheless, relational databases and their de facto language, Structured Query Language (SQL) are still prominent among wider user groups. ^ This thesis project ventures into bridging the gap between Hadoop and relational databases through allowing multi-query functionality to a SQL-to-MapReduce translator. In addition to that, this research also includes the upgrade of the translator to a newer Hadoop version to utilize newer tools and features added since its original deployment. ^ This study also includes the analysis of the modified translator\u27s behavior under different sets of conditions. A regression model was devised for each of the experiments made and presented as significant means of understanding the data collected and any future estimates

    Intermediate Results Materialization Selection and Format for Data-Intensive Flows

    Get PDF
    Data-intensive flows deploy a variety of complex data transformations to build information pipelines from data sources to different end users. As data are processed, these workflows generate large intermediate results, typically pipelined from one operator to the following ones. Materializing intermediate results, shared among multiple flows, brings benefits not only in terms of performance but also in resource usage and consistency. Similar ideas have been proposed in the context of data warehouses, which are studied under the materialized view selection problem. With the rise of Big Data systems, new challenges emerge due to new quality metrics captured by service level agreements which must be taken into account. Moreover, the way such results are stored must be reconsidered, as different data layouts can be used to reduce the I/O cost. In this paper, we propose a novel approach for automatic selection of multi-objective materialization of intermediate results in data-intensive flows, which can tackle multiple and conflicting quality objectives. In addition, our approach chooses the optimal storage data format for selected materialized intermediate results based on subsequent access patterns. The experimental results show that our approach provides 40% better average speedup with respect to the current state-of-the-art, as well as an improvement on disk access time of 18% as compared to fixed format solutions

    Cache-Based Multi-Query Optimization for Data-Intensive Scalable Computing Frameworks

    Get PDF
    In modern large-scale distributed systems, analytics jobs submitted by various users often share similar work, for example scanning and processing the same subset of data. Instead of optimizing jobs independently, which may result in redundant and wasteful processing, multi-query optimization techniques can be employed to save a considerable amount of cluster resources. In this work, we introduce a novel method combining in-memory cache primitives and multi-query optimization, to improve the efficiency of data-intensive, scalable computing frameworks. By careful selection and exploitation of common (sub)expressions, while satisfying memory constraints, our method transforms a batch of queries into a new, more efficient one which avoids unnecessary recomputations. To find feasible and efficient execution plans, our method uses a cost-based optimization formulation akin to the multiple-choice knapsack problem. Extensive experiments on a prototype implementation of our system show significant benefits of worksharing for both TPC-DS workloads and detailed micro-benchmarks

    Materialisierte views in verteilten key-value stores

    Get PDF
    Distributed key-value stores have become the solution of choice for warehousing large volumes of data. However, their architecture is not suitable for real-time analytics. To achieve the required velocity, materialized views can be used to provide summarized data for fast access. The main challenge then, is the incremental, consistent maintenance of views at large scale. Thus, we introduce our View Maintenance System (VMS) to maintain SQL queries in a data-intensive real-time scenario.Verteilte key-value stores sind ein Typ moderner Datenbanken um große Mengen an Daten zu verarbeiten. Trotzdem erlaubt ihre Architektur keine analytischen Abfragen in Echtzeit. Materialisierte Views können diesen Nachteil ausgleichen, indem sie schnellen Zuriff auf Ergebnisse ermöglichen. Die Herausforderung ist dann, das inkrementelle und konsistente Aktualisieren der Views. Daher präsentieren wir unser View Maintenance System (VMS), das datenintensive SQL Abfragen in Echtzeit berechnet

    Multi-Tenant Geo-Distributed Data Analytics

    Get PDF
    University of Minnesota Ph.D. dissertation. July 2019. Major: Computer Science. Advisors: Abhishek Chandra, Jon Weissman. 1 computer file (PDF); x, 132 pages.Geo-distributed data analytics has gained much interest in recent years due to the need for extracting insights from geo-distributed data. Traditionally, data analytics has been done within a cluster/data center environment. However, analyzing geo-distributed data using existing cluster-based systems typically cannot satisfy the timeliness requirement of most applications and result in wasteful resource consumption due to the fundamental differences of the environments, especially due to the scarce, highly heterogeneous, and dynamic nature of the wide-area resources: compute power and network bandwidth. This thesis addresses the challenges faced by geo-distributed data analytics systems in ensuring high-performance and reliable execution of multiple data analytics applications/queries. Specifically, the focus is on sharing resources across multiple users, applications, and computing frameworks. Sharing resources is attractive as it increases resource utilization and reduces operational cost. However, ensuring high-performance execution of multiple applications in a shared environment is challenging as they may compete for the same resources, especially in a wide-area environment with scarce resources. Furthermore, dynamics such as workload variation, resource variation, stragglers, and failures are inevitable in large-scale distributed systems. These can cause large resource perturbation that significantly affect the performance of query executions. This thesis makes the following contributions. First, we present a resource sharing technique across multiple geo-distributed data analytics frameworks. The main challenge here is how to elastically partition resources while allowing high locality scheduling to each individual framework, which is critical to the execution performance of geo-distributed analytics queries. We then address the problem of how to identify and exploit common executions across multiple queries to mitigate wasteful resource consumption. We demonstrate that traditional multi-query optimization may degrade the overall query execution performance due to its lack of support for network awareness. Finally, we highlight the importance of adaptability in ensuring reliable query execution in the presence of dynamics, both for single and multiple query executions. We propose a systematic approach that can selectively determine which queries to adapt and how to adapt them based on the types of queries, dynamics, and optimization goals