589 research outputs found

    A unified view of data-intensive flows in business intelligence systems : a survey

    Get PDF
    Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today’s research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.Peer ReviewedPostprint (author's final draft

    Reducing the View Selection Problem through Code Modeling: Static and Dynamic approaches

    Get PDF
    2015 - 2016Data  warehouse  systems aim to support decision making by providing users with the appropriate  information  at  the right time. This task is particularly challenging in business contexts where large  amount of data is produced at a high speed. To this end, data warehouses have been equipped with  Online Analytical Processing tools that help users to make fast and precise decisions througt the  execution of complex queries. Since the computation of these queries is time consuming, data   warehouses precompute a set of materialized views answering to the workload  queries.   This thesis work defines a process to determine the minimal set of workload queries and the set of views to materialize. The set of queries is represented by an optimized lattice structure used to select  the views to be materialized according to the processing time costs and the view storage space. The minimal set of required Online Analytical Processing queries is computer by analyzing the data model defined with the visual language CoDe (Complexity Design). The latter allows to conceptually organizatio  the visualization of data reports and to generate visualizations of data obtained from data-­‐mart queries. CoDe adopts a hybrid modeling process combining two main methodologieser-­‐driven and data-­ driven. The first aims to create a model according to  the  user  knowledge,  re-quirements, and analysis needs, whilst the latter has in  charge to concretize data  and their relationships in the model through Online Analytical Processing queries. Since the materialized views change over time, we also propose a dynamic process that allows users to upgrade the CoDe model with a context-­‐aware editor, build an optimized lattice structure able to  minimize the effort to recalculate it,and propose the new set of views  to  materialize  Moreover,  the  process applies a Markov strategy to predict whether the views need to be recalculate or not  according to the changes of the model. The effectiveness of the proposed  techniques has  been  evaluated on a real world data warehouse. The results  revealed that the Markov strategy gives a better set of solutions in term of storage space and total processing cost. [edited by author]  XV n.

    Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

    Get PDF
    Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.Comment: SIGMOD'1

    Data Warehouse Design and Management: Theory and Practice

    Get PDF
    The need to store data and information permanently, for their reuse in later stages, is a very relevant problem in the modern world and now affects a large number of people and economic agents. The storage and subsequent use of data can indeed be a valuable source for decision making or to increase commercial activity. The next step to data storage is the efficient and effective use of information, particularly through the Business Intelligence, at whose base is just the implementation of a Data Warehouse. In the present paper we will analyze Data Warehouses with their theoretical models, and illustrate a practical implementation in a specific case study on a pharmaceutical distribution companyData warehouse, database, data model.

    Exploiting Data Skew for Improved Query Performance

    Full text link
    Analytic queries enable sophisticated large-scale data analysis within many commercial, scientific and medical domains today. Data skew is a ubiquitous feature of these real-world domains. In a retail database, some products are typically much more popular than others. In a text database, word frequencies follow a Zipf distribution with a small number of very common words, and a long tail of infrequent words. In a geographic database, some regions have much higher populations (and data measurements) than others. Current systems do not make the most of caches for exploiting skew. In particular, a whole cache line may remain cache resident even though only a small part of the cache line corresponds to a popular data item. In this paper, we propose a novel index structure for repositioning data items to concentrate popular items into the same cache lines. The net result is better spatial locality, and better utilization of limited cache resources. We develop a theoretical model for analyzing the cache behavior, and implement database operators that are efficient in the presence of skew. Our experiments on real and synthetic data show that exploiting skew can significantly improve in-memory query performance. In some cases, our techniques can speed up queries by over an order of magnitude
    corecore