119 research outputs found

    CubiST++: Evaluating Ad-Hoc CUBE Queries Using Statistics Trees

    Get PDF
    We report on a new, efficient encoding for the data cube, which results in a drastic speed-up of OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes. We are focusing on a class of queries called cube queries, which return aggregated values rather than sets of tuples. Our approach, termed CubiST++ (Cubing with Statistics Trees Plus Families), represents a drastic departure from existing relational (ROLAP) and multi-dimensional (MOLAP) approaches in that it does not use the view lattice to compute and materialize new views from existing views in some heuristic fashion. Instead, CubiST++ encodes all possible aggregate views in the leaves of a new data structure called statistics tree (ST) during a one-time scan of the detailed data. In order to optimize the queries involving constraints on hierarchy levels of the underlying dimensions, we select and materialize a family of candidate trees, which represent superviews over the different hierarchical levels of the dimensions. Given a query, our query evaluation algorithm selects the smallest tree in the family, which can provide the answer. Extensive evaluations of our prototype implementation have demonstrated its superior run-time performance and scalability when compared with existing MOLAP and ROLAP systems

    Multidimensional process discovery

    Get PDF

    Forecasting in Database Systems

    Get PDF
    Time series forecasting is a fundamental prerequisite for decision-making processes and crucial in a number of domains such as production planning and energy load balancing. In the past, forecasting was often performed by statistical experts in dedicated software environments outside of current database systems. However, forecasts are increasingly required by non-expert users or have to be computed fully automatically without any human intervention. Furthermore, we can observe an ever increasing data volume and the need for accurate and timely forecasts over large multi-dimensional data sets. As most data subject to analysis is stored in database management systems, a rising trend addresses the integration of forecasting inside a DBMS. Yet, many existing approaches follow a black-box style and try to keep changes to the database system as minimal as possible. While such approaches are more general and easier to realize, they miss significant opportunities for improved performance and usability. In this thesis, we introduce a novel approach that seamlessly integrates time series forecasting into a traditional database management system. In contrast to flash-back queries that allow a view on the data in the past, we have developed a Flash-Forward Database System (F2DB) that provides a view on the data in the future. It supports a new query type - a forecast query - that enables forecasting of time series data and is automatically and transparently processed by the core engine of an existing DBMS. We discuss necessary extensions to the parser, optimizer, and executor of a traditional DBMS. We furthermore introduce various optimization techniques for three different types of forecast queries: ad-hoc queries, recurring queries, and continuous queries. First, we ease the expensive model creation step of ad-hoc forecast queries by reducing the amount of processed data with traditional sampling techniques. Second, we decrease the runtime of recurring forecast queries by materializing models in a specialized index structure. However, a large number of time series as well as high model creation and maintenance costs require a careful selection of such models. Therefore, we propose a model configuration advisor that determines a set of forecast models for a given query workload and multi-dimensional data set. Finally, we extend forecast queries with continuous aspects allowing an application to register a query once at our system. As new time series values arrive, we send notifications to the application based on predefined time and accuracy constraints. All of our optimization approaches intend to increase the efficiency of forecast queries while ensuring high forecast accuracy

    An architecture for recycling intermediates in a column-store

    Get PDF
    Automatically recycling (intermediate) results is a grand challenge for state-of-the-art databases to improve both query response time and throughput. Tuples are loaded and streamed through a tuple-at-a-time processing pipeline avoiding materialization of intermediates as much as possible. This limits the opportunities for reuse of overlapping computations to DBA-defined materialized views and function/result cache tuning. In contrast, the operator-at-a-time execution paradigm produces fully materialized results in each step of the query plan. To avoid resource contention, these intermediates are evicted as soon as possible. In this paper we study an architecture that harvests the by-products of the operator-at-a-time paradigm in a column store system using a lightweight mechanism, the recycler. The key challenge then becomes selection of the policies to admit intermediates to the resource pool, their retention period, and the eviction strategy when facing resource limitations. The proposed recycling architecture has been implemented in an open-source system. An experimental analysis against the TPC-H ad-hoc decision support benchmark and a complex, real-world application (SkyServer) demonstrates its effectiveness in terms of self-organizing behavior and its significant performance gains. The results indicate the potentials of recycling intermediates and charters a route for further development of database kernels

    Maintenance-cost view-selection in large data warehouse systems: algorithms, implementations and evaluations.

    Get PDF
    Choi Chi Hon.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 120-126).Abstracts in English and Chinese.Abstract --- p.iAbstract (Chinese) --- p.iiAcknowledgement --- p.iiiContents --- p.ivList of Figures --- p.viiiList of Tables --- p.xChapter 1 --- Introduction --- p.1Chapter 1.1 --- Maintenance Cost View Selection Problem --- p.2Chapter 1.2 --- Previous Research Works --- p.3Chapter 1.3 --- Major Contributions --- p.4Chapter 1.4 --- Thesis Organization --- p.6Chapter 2 --- Literature Review --- p.7Chapter 2.1 --- Data Warehouse and OLAP Systems --- p.8Chapter 2.1.1 --- What Is Data Warehouse? --- p.8Chapter 2.1.2 --- What Is OLAP? --- p.10Chapter 2.1.3 --- Difference Between Operational Database Systems and OLAP --- p.10Chapter 2.1.4 --- Data Warehouse Architecture --- p.12Chapter 2.1.5 --- Multidimensional Data Model --- p.13Chapter 2.1.6 --- Star Schema and Snowflake Schema --- p.15Chapter 2.1.7 --- Data Cube --- p.17Chapter 2.1.8 --- ROLAP and MOLAP --- p.19Chapter 2.1.9 --- Query Optimization --- p.20Chapter 2.2 --- Materialized View --- p.22Chapter 2.2.1 --- What Is A Materialized View --- p.23Chapter 2.2.2 --- The Role of Materialized View in OLAP --- p.23Chapter 2.2.3 --- The Challenges in Exploiting Materialized View --- p.24Chapter 2.2.4 --- What Is View Maintenance --- p.25Chapter 2.3 --- View Selection --- p.27Chapter 2.3.1 --- Selection Strategy --- p.27Chapter 2.4 --- Summary --- p.32Chapter 3 --- Problem Definition --- p.33Chapter 3.1 --- View Selection Under Constraint --- p.33Chapter 3.2 --- The Lattice Framework for Maintenance Cost View Selection Prob- lem --- p.35Chapter 3.3 --- The Difficulties of Maintenance Cost View Selection Problem --- p.39Chapter 3.4 --- Summary --- p.41Chapter 4 --- What Difference Heuristics Make --- p.43Chapter 4.1 --- Motivation --- p.44Chapter 4.2 --- Example --- p.46Chapter 4.3 --- Existing Algorithms --- p.49Chapter 4.3.1 --- A*-Heuristic --- p.51Chapter 4.3.2 --- Inverted-Tree Greedy --- p.52Chapter 4.3.3 --- Two-Phase Greedy --- p.54Chapter 4.3.4 --- Integrated Greedy --- p.57Chapter 4.4 --- A Performance Study --- p.60Chapter 4.5 --- Summary --- p.68Chapter 5 --- Materialized View Selection as Constrained Evolutionary Opti- mization --- p.71Chapter 5.1 --- Motivation --- p.72Chapter 5.2 --- Evolutionary Algorithms --- p.73Chapter 5.2.1 --- Constraint Handling: Penalty v.s. Stochastic Ranking --- p.74Chapter 5.2.2 --- The New Stochastic Ranking Evolutionary Algorithm --- p.78Chapter 5.3 --- Experimental Studies --- p.81Chapter 5.3.1 --- Experimental Setup --- p.82Chapter 5.3.2 --- Experimental Results --- p.82Chapter 5.4 --- Summary --- p.89Chapter 6 --- Dynamic Materialized View Management Based On Predicates --- p.90Chapter 6.1 --- Motivation --- p.91Chapter 6.2 --- Examples --- p.93Chapter 6.3 --- Related Work: Static Prepartitioning-Based Materialized View Management --- p.96Chapter 6.4 --- A New Dynamic Predicate-based Partitioning Approach --- p.99Chapter 6.4.1 --- System Overview --- p.102Chapter 6.4.2 --- Partition Advisor --- p.103Chapter 6.4.3 --- View Manager --- p.104Chapter 6.5 --- A Performance Study --- p.108Chapter 6.5.1 --- Performance Metrics --- p.110Chapter 6.5.2 --- Feasibility Studies --- p.110Chapter 6.5.3 --- Query Locality --- p.112Chapter 6.5.4 --- The Effectiveness of Disk Size --- p.115Chapter 6.5.5 --- Scalability --- p.115Chapter 6.6 --- Summary --- p.116Chapter 7 --- Conclusions and Future Work --- p.118Bibliography --- p.12

    Research on Materialized View Selection

    Get PDF
    定义了数据仓库领域的视图选择问题,并讨论了与该问题相关的代价模型、收益函数、代价计算、约束条件和视图索引等内容;介绍了3大类视图选择方法,即静态方法、动态方法和混合方法,以及各类方法的代表性研究成果;最后展望未来的研究方向.Definition of view selection issue in the field of data warehouses is presented, followed by the discussion of related problems, such as cost model, benefit function, cost computation, restriction condition, view index, etc. Then three categories of view selection methods, namely, static, dynamic and hybrid methods are discussed. For each method, some representative work is introduced. Finally some future trends in this area are discussed.Supported by the National Natural Science Foundation of China under Grant No.60473051 (国家自然科学基金); the National High-Tech Research and Development Plan of China under Grant Nos.2007AA01Z191, 2006AA01Z230 (国家高技术研究发展计划(863)

    An architecture for recycling intermediates in a column-store

    Full text link
    textabstractAutomatically recycling (intermediate) results is a grand challenge for state-of-the-art databases to improve both query response time and throughput. Tuples are loaded and streamed through a tuple-at-a-time processing pipeline avoiding materialization of intermediates as much as possible. This limits the opportunities for reuse of overlapping computations to DBA-defined materialized views and function/result cache tuning. In contrast, the operator-at-a-time execution paradigm produces fully materialized results in each step of the query plan. To avoid resource contention, these intermediates are evicted as soon as possible. In this paper we study an architecture that harvests the by-products of the operator-at-a-time paradigm in a column store system using a lightweight mechanism, the recycler. The key challenge then becomes selection of the policies to admit intermediates to the resource pool, their retention period, and the eviction strategy when facing resource limitations. The proposed recycling architecture has been implemented in an open-source system. An experimental analysis against the TPC-H ad-hoc decision support benchmark and a complex, real-world application (SkyServer) demonstrates its effectiveness in terms of self-organizing behavior and its significant performance gains. The results indicate the potentials of recycling intermediates and charters a route for further development of database kernels

    Open City Data Pipeline

    Get PDF
    Statistical data about cities, regions and at country level is collected for various purposes and from various institutions. Yet, while access to high quality and recent such data is crucial both for decision makers as well as for the public, all to often such collections of data remain isolated and not re-usable, let alone properly integrated. In this paper we present the Open City Data Pipeline, a focused attempt to collect, integrate, and enrich statistical data collected at city level worldwide, and republish this data in a reusable manner as Linked Data. The main feature of the Open City Data Pipeline are: (i) we integrate and cleanse data from several sources in a modular and extensible, always up-to-date fashion; (ii) we use both Machine Learning techniques as well as ontological reasoning over equational background knowledge to enrich the data by imputing missing values, (iii) we assess the estimated accuracy of such imputations per indicator. Additionally, (iv) we make the integrated and enriched data available both in a we browser interface and as machine-readable Linked Data, using standard vocabularies such as QB and PROV, and linking to e.g. DBpedia. Lastly, in an exhaustive evaluation of our approach, we compare our enrichment and cleansing techniques to a preliminary version of the Open City Data Pipeline presented at ISWC2015: firstly, we demonstrate that the combination of equational knowledge and standard machine learning techniques significantly helps to improve the quality of our missing value imputations; secondly, we arguable show that the more data we integrate, the more reliable our predictions become. Hence, over time, the Open City Data Pipeline shall provide a sustainable effort to serve Linked Data about cities in increasing quality.Series: Working Papers on Information Systems, Information Business and Operation
    corecore