Skip to main content
Article thumbnail
Location of Repository

Data Cube Approximation and Mining using Probabilistic Modeling

By Cyril Goutte, Rokia Missaoui and Ameur Boujenoui

Abstract

On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction levels in a dimension hierarchy. However, such techniques are not aimed at mining multidimensional data. Since data cubes are nothing but multi-way tables, we propose to analyze the potential of two probabilistic modeling techniques, namely non-negative multi-way array factorization and log-linear modeling, with the ultimate objective of compressing and mining aggregate and multidimensional values. With the first technique, we compute the set of components that best fit the initial data set and whose superposition coincides with the original data; with the second technique we identify a parsimonious model (i.e., one with a reduced set of parameters), highlight strong associations among dimensions and discover possible outliers in data cells. A real life example will be used to (i) discuss the potential benefits of the modeling output on cube exploration and mining, (ii) show how OLAP queries can be answered in an approximate way, and (iii) illustrate the strengths and limitations of these modeling approaches

Topics: Machine Learning, Artificial Intelligence
Year: 2007
OAI identifier: oai:cogprints.org:5622
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://cogprints.org/5622/1/go... (external link)
  • http://cogprints.org/5622/ (external link)
  • Suggested articles

    Citations

    1. (1990). A deterministic annealing approach to clustering.
    2. (2002). A hierarchical model for clustering and categorising documents.
    3. (1974). A new look at the statistical model identification.
    4. (2004). A new olap aggregation based on the ahc technique.
    5. (2001). A weighted non-negative matrix factorization for local representation.
    6. (2001). Algorithms for non-negative matrix factorization.
    7. (1999). An improvement of the nec criterion for assessing the number of clusters in a mixture model. Pattern Recognition Letter,
    8. (1997). An overview of data warehousing and olap technology.
    9. (1978). Analysis of qualitative data,
    10. (2004). Application of non-negative and local non negative matrix factorization to facial expression recognition.
    11. (2001). Application of non-negative matrix factorization to dynamic positron emission tomography.
    12. (2000). Assessing a mixture model for clustering with the integrated completed likelihood.
    13. (2000). Beyond intratransaction association analysis: mining multidimensional intertransaction association rules.
    14. (1999). Compressed data cubes for olap aggregate query approximation on continuous dimensions.
    15. (2006). created the rating system.
    16. (2006). Decomposing the timefrequency representation of EEG using non-negative matrix and multi-way factorization.
    17. (1998). Discovery-driven exploration of olap data cubes.
    18. (2003). Document clustering based on non-negative matrix factorization. In
    19. (2003). Dynamic sample selection for approximate query processing.
    20. (2006). Effet de la structure des droits de vote sur la qualite┬┤ des me┬┤canismes internes de gouvernance: cas des entreprises canadiennes.
    21. (1978). Estimating the dimension of a model.
    22. (2000). Icicles: Self-tuning samples for approximate query answering.
    23. (1996). Implementing data cubes efficiently.
    24. (2003). Introducing a weighted non-negative matrix factorization for image classification.
    25. (2004). Kenji Yamada, and Eric Gaussier. Aligning words using matrix factorisation.
    26. (1999). Learning the parts of objects by non-negative matrix factorization.
    27. (2005). Log-linear models, volume Encyclopedia of Statistics in Behavioral Science,
    28. (1997). Log-linear Models.
    29. (2001). Loglinear-based quasi cubes.
    30. (1977). Maximum likelihood from incomplete data via the EM algorithm.
    31. (2003). Non-negative matrix factorization for polyphonic music transcription.
    32. (2006). Nonnegative matrix approximation: Algorithms and applications.
    33. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal total are known.
    34. (2001). Positive tensor factorization. doi
    35. (1999). Probabilistic latent semantic analysis.
    36. (2002). Quotient cube: How to summarize the semantics of a data cube.
    37. (2005). Relation between PLSA and NMF and implications.
    38. (2003). Star-cubing: Computing iceberg cubes by top-down and bottom-up integration. doi
    39. (1983). Understanding Robust and Exploratory Data Analysis.
    40. (2005). Using datacube aggregates for approximate querying and deviation detection.
    41. (2000). Using loglinear models to compress datacube.
    42. (2002). Variational extensions to EM and multinomial PCA.
    43. (2004). Web usage mining based on probabilistic latent semantic analysis.

    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.