Search CORE

25 research outputs found

Efficient Evaluation of Sparse Data Cubes

Author: Fu Lixin
NC DOCKS at The University of North Carolina at Greensboro
Publication venue
Publication date: 01/01/2004
Field of study

Computing data cubes requires the aggregation of measures over arbitrary combinations of dimensions in a data set. Efficient data cube evaluation remains challenging because of the potentially very large sizes of input datasets (e.g., in the data warehousing context), the well-known curse of dimensionality, and the complexity of queries that need to be supported. This paper proposes a new dynamic data structure called SST (Sparse Statistics Trees) and a novel, in-teractive, and fast cube evaluation algorithm called CUPS (Cubing by Pruning SST), which is especially well suitable for computing aggregates in cubes whose data sets are sparse. SST only stores the aggregations of non-empty cube cells instead of the detailed records. Furthermore, it retains in memory the dense cubes (a.k.a. iceberg cubes) whose aggregate values are above a threshold. Sparse cubes are stored on disks. This allows a fast, accurate approximation for queries. If users desire more refined answers, related sparse cubes are aggregated. SST is incrementally maintainable, which makes CUPS suitable for data warehousing and analysis of streaming data. Experiment results demonstrate the excellent performance and good scalability of our approach

Data Cube Approximation and Mining using Probabilistic Modeling

Author: Boujenoui Ameur
Goutte Cyril
Missaoui Rokia
Publication venue
Publication date: 01/01/2007
Field of study

On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction levels in a dimension hierarchy. However, such techniques are not aimed at mining multidimensional data. Since data cubes are nothing but multi-way tables, we propose to analyze the potential of two probabilistic modeling techniques, namely non-negative multi-way array factorization and log-linear modeling, with the ultimate objective of compressing and mining aggregate and multidimensional values. With the first technique, we compute the set of components that best fit the initial data set and whose superposition coincides with the original data; with the second technique we identify a parsimonious model (i.e., one with a reduced set of parameters), highlight strong associations among dimensions and discover possible outliers in data cells. A real life example will be used to (i) discuss the potential benefits of the modeling output on cube exploration and mining, (ii) show how OLAP queries can be answered in an approximate way, and (iii) illustrate the strengths and limitations of these modeling approaches

Scalable Data Analysis on MapReduce-based Systems

Author: WANG ZHENGKUI
Publication venue
Publication date: 19/06/2013
Field of study

Ph.DDOCTOR OF PHILOSOPH

Enabling Efficient and General Subpopulation Analytics in Multidimensional Data Streams

Author: Ben Basat Ran
Cheng Zhuo
Liu Zaoxing
Manousis Antonis
Sekar Vyas
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/07/2022
Field of study

Today’s large-scale services (e.g., video streaming platforms, data centers, sensor grids) need diverse real-time summary statistics across multiple subpopulations of multidimensional datasets. However, state-of-the-art frameworks do not offer general and accurate analytics in real time at reasonable costs. The root cause is the combinatorial explosion of data subpopulations and the diversity of summary statistics we need to monitor simultaneously. We present Hydra, an efficient framework for multidimensional analytics that presents a novel combination of using a “sketch of sketches” to avoid the overhead of monitoring exponentially-many subpopulations and universal sketching to ensure accurate estimates for multiple statistics. We build Hydra as an Apache Spark plugin and address practical system challenges to minimize overheads at scale. Across multiple real-world and synthetic multidimensional datasets, we show that Hydra can achieve robust error bounds and is an order of magnitude more efficient in terms of operational cost and memory footprint than existing frameworks (e.g., Spark, Druid) while ensuring interactive estimation times

Java, Java, Java: Object-Oriented Problem Solving

Author: Morelli Ralph
Walde Ralph
Publication venue: Ralph Morelli, Ralph Walde
Publication date: 01/01/2016
Field of study

Open Access Textbook from Open Textbook Library: Java, Java, Java, 3e was previously published by Pearson Education, Inc. The first edition (2000) and the second edition (2003) were published by Prentice-Hall. In 2010 Pearson Education, Inc. reassigned the copyright to the authors, and we are happy now to be able to make the book available under an open source license. This PDF edition of the book is available under a Creative Commons Attribution 4.0 International License, which allows the book to be used, modified, and shared with attribution: (https://creativecommons.org/licenses/by/4.0/). – Ralph Morelli and Ralph Walde – Hartford, CT – December 30, 201