616 research outputs found
CubiST++: Evaluating Ad-Hoc CUBE Queries Using Statistics Trees
We report on a new, efficient encoding for the data cube, which results in a drastic speed-up of OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes. We are focusing on a class of queries called cube queries, which return aggregated values rather than sets of tuples. Our approach, termed CubiST++ (Cubing with Statistics Trees Plus Families), represents a drastic departure from existing relational (ROLAP) and multi-dimensional (MOLAP) approaches in that it does not use the view lattice to compute and materialize new views from existing views in some heuristic fashion. Instead, CubiST++ encodes all possible aggregate views in the leaves of a new data structure called statistics tree (ST) during a one-time scan of the detailed data. In order to optimize the queries involving constraints on hierarchy levels of the underlying dimensions, we select and materialize a family of candidate trees, which represent superviews over the different hierarchical levels of the dimensions. Given a query, our query evaluation algorithm selects the smallest tree in the family, which can provide the answer. Extensive evaluations of our prototype implementation have demonstrated its superior run-time performance and scalability when compared with existing MOLAP and ROLAP systems
On-line analytical processing in distributed data warehouses
The concepts of 'data warehousing' and 'on-line analytical processing' have seen a growing interest in the research and commercial product community. Today, the trend moves away from complex centralized data warehouses to distributed data marts integrated in a common conceptual schema. However, as the first part of this paper demonstrates, there are many problems and little solutions for large distributed decision support systems in worldwide operating corporations. After showing the benefits and problems of the distributed approach, this paper outlines possibilities for achieving performance in distributed online analytical processing. Finally, the architectural framework of the prototypical distributed OLAP system CUBESTAR is outlined
EFFICIENT APPROACH FOR VIEW SELECTION FOR DATA WAREHOUSE USING TREE MINING AND EVOLUTIONARY COMPUTATION
Selection of a proper set of views to materialize plays an important role indatabase performance. There are many methods of view selection which uses different techniques and frameworks to select an efficient set of views for materialization. In this paper, we present a new efficient, scalable method for view selection under the given storage constraints using a tree mining approach and evolutionary optimization. Tree mining algorithm is designed to determine the exact frequency of (sub)queries in the historical SQL dataset. Query Cost model achieves the objective of maximizing the performance benefits from the final view set which is derived from the frequent view set given by tree mining algorithm. Performance benefit of a query is defined as a function of queryfrequency, query creation cost, and query maintenance cost. The experimental results shows that the proposed method is successful in recommending a solution which is fairly close to optimal solution
The use of alternative data models in data warehousing environments
Data Warehouses are increasing their data volume at an accelerated rate; high disk
space consumption; slow query response time and complex database administration are
common problems in these environments. The lack of a proper data model and an
adequate architecture specifically targeted towards these environments are the root
causes of these problems.
Inefficient management of stored data includes duplicate values at column level and
poor management of data sparsity which derives from a low data density, and affects
the final size of Data Warehouses. It has been demonstrated that the Relational Model
and Relational technology are not the best techniques for managing duplicates and data
sparsity.
The novelty of this research is to compare some data models considering their data
density and their data sparsity management to optimise Data Warehouse environments.
The Binary-Relational, the Associative/Triple Store and the Transrelational models
have been investigated and based on the research results a novel Alternative Data
Warehouse Reference architectural configuration has been defined.
For the Transrelational model, no database implementation existed. Therefore it was
necessary to develop an instantiation of it’s storage mechanism, and as far as could be
determined this is the first public domain instantiation available of the storage
mechanism for the Transrelational model
Treatment of imprecision in data repositories with the aid of KNOLAP
Traditional data repositories introduced for the needs of business
processing, typically focus on the storage and querying of crisp
domains of data. As a result, current commercial data repositories
have no facilities for either storing or querying imprecise/
approximate data.
No significant attempt has been made for a generic and applicationindependent
representation of value imprecision mainly as a
property of axes of analysis and also as part of dynamic
environment, where potential users may wish to define their “own”
axes of analysis for querying either precise or imprecise facts. In
such cases, measured values and facts are characterised by
descriptive values drawn from a number of dimensions, whereas
values of a dimension are organised as hierarchical levels.
A solution named H-IFS is presented that allows the representation
of flexible hierarchies as part of the dimension structures. An
extended multidimensional model named IF-Cube is put forward,
which allows the representation of imprecision in facts and
dimensions and answering of queries based on imprecise
hierarchical preferences. Based on the H-IFS and IF-Cube
concepts, a post relational OLAP environment is delivered, the
implementation of which is DBMS independent and its performance
solely dependent on the underlying DBMS engine
Flexible Integration and Efficient Analysis of Multidimensional Datasets from the Web
If numeric data from the Web are brought together, natural scientists can compare climate measurements with estimations, financial analysts can evaluate companies based on balance sheets and daily stock market values, and citizens can explore the GDP per capita from several data sources. However, heterogeneities and size of data remain a problem. This work presents methods to query a uniform view - the Global Cube - of available datasets from the Web and builds on Linked Data query approaches
Efficient Evaluation of Sparse Data Cubes
Computing data cubes requires the aggregation of measures over arbitrary combinations of dimensions in a data set. Efficient data cube evaluation remains challenging because of the potentially very large sizes of input datasets (e.g., in the data warehousing context), the well-known curse of dimensionality, and the complexity of queries that need to be supported. This paper proposes a new dynamic data structure called SST (Sparse Statistics Trees) and a novel, in-teractive, and fast cube evaluation algorithm called CUPS (Cubing by Pruning SST), which is especially well suitable for computing aggregates in cubes whose data sets are sparse. SST only stores the aggregations of non-empty cube cells instead of the detailed records. Furthermore, it retains in memory the dense cubes (a.k.a. iceberg cubes) whose aggregate values are above a threshold. Sparse cubes are stored on disks. This allows a fast, accurate approximation for queries. If users desire more refined answers, related sparse cubes are aggregated. SST is incrementally maintainable, which makes CUPS suitable for data warehousing and analysis of streaming data. Experiment results demonstrate the excellent performance and good scalability of our approach
Efficient Incremental View Maintenance for Data Warehousing
Data warehousing and on-line analytical processing (OLAP) are essential elements for decision support applications. Since most OLAP queries are complex and are often executed over huge volumes of data, the solution in practice is to employ materialized views to improve query performance. One important issue for utilizing materialized views is to maintain the view consistency upon source changes. However, most prior work focused on simple SQL views with distributive aggregate functions, such as SUM and COUNT. This dissertation proposes to consider broader types of views than previous work. First, we study views with complex aggregate functions such as variance and regression. Such statistical functions are of great importance in practice. We propose a workarea function model and design a generic framework to tackle incremental view maintenance and answering queries using views for such functions. We have implemented this approach in a prototype system of IBM DB2. An extensive performance study shows significant performance gains by our techniques. Second, we consider materialized views with PIVOT and UNPIVOT operators. Such operators are widely used for OLAP applications and for querying sparse datasets. We demonstrate that the efficient maintenance of views with PIVOT and UNPIVOT operators requires more generalized operators, called GPIVOT and GUNPIVOT. We formally define and prove the query rewriting rules and propagation rules for such operators. We also design a novel view maintenance framework for applying these rules to obtain an efficient maintenance plan. Extensive performance evaluations reveal the effectiveness of our techniques. Third, materialized views are often integrated from multiple data sources. Due to source autonomicity and dynamicity, concurrency may occur during view maintenance. We propose a generic concurrency control framework to solve such maintenance anomalies. This solution extends previous work in that it solves the anomalies under both source data and schema changes and thus achieves full source autonomicity. We have implemented this technique in a data warehouse prototype developed at WPI. The extensive performance study shows that our techniques put little extra overhead on existing concurrent data update processing techniques while allowing for this new functionality
- …