Skip to main content
Article thumbnail
Location of Repository

A Probabilistic Model for Data Cube Compression and Query Approximation

By Rokia Missaoui, Cyril Goutte, Anicet Kouomou Choupo and Ameur Boujenoui

Abstract

Databases and data warehouses contain an overwhelming volume of information that users must wade through in order to extract valuable and actionable knowledge to support the decision-making process. This contribution addresses the problem of automatically analyzing large multidimensional tables to get a concise representation of data, identify patterns and provide approximate answers to queries. Since data cubes are nothing but multi-way tables, we propose to analyze the potential of a probabilistic modeling technique, called non-negative multi-way array factorization, for approximating aggregate and multidimensional values. Using such a technique, we compute the set of components (clusters) that best fit the initial data set and whose superposition approximates the original data. The generated components can then be exploited for approximately answering OLAP queries such as roll-up, slice and dice operations. The proposed modeling technique will then be compared against the log-linear modeling technique which has already been used in the literature for compression and outlier detection in data cubes. Finally, three data sets will be used to discuss the potential benefits of non-negative multi-way array factorization

Topics: Statistical Models, Artificial Intelligence
Publisher: Sheridan Printing
Year: 2007
OAI identifier: oai:cogprints.org:5702

Suggested articles

Citations

  1. (1974). A new look at the statistical model identification.
  2. (1997). An overview of data warehousing and OLAP technology.
  3. (1978). Analysis of qualitative data,
  4. (2001). Approximate query processing using wavelets.
  5. (2002). Cubegrades: Generalizing association rules.
  6. (2007). Data cube approximation and mining using probabilistic modelling.
  7. (2003). Dynamic sample selection for approximate query processing.
  8. (2006). Effet de la structure des droits de vote sur la qualite┬┤ des me┬┤canismes internes de gouvernance: cas des entreprises canadiennes.
  9. (2001). Fast approximate evaluation of OLAP queries for integrated statistical data.
  10. (2005). Generalized nonnegative matrix approximations with Bregman divergences.
  11. (2000). Icicles: Self-tuning samples for approximate query answering.
  12. (1999). Learning the parts of objects by non-negative matrix factorization.
  13. (1997). Log-linear Models.
  14. (2001). Loglinear-based quasi cubes.
  15. (1977). Maximum likelihood from incomplete data via the EM algorithm.
  16. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal totals are known.
  17. (1999). Probabilistic latent semantic analysis.
  18. (2002). Quotient cube: How to summarize the semantics of a data cube.
  19. (2005). Relation between PLSA and NMF and implications.
  20. (2000). Using loglinear models to compress datacubes.

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.