Search CORE

8 research outputs found

Histograms and Wavelets on Probabilistic Data

Author: Cormode Graham
Garofalakis Minos
Publication venue
Publication date: 01/01/2008
Field of study

There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in probabilistic database systems. In this paper, we introduce definitions and algorithms for building histogram- and wavelet-based synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal B-term histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamic-programming-based techniques for the deterministic domain. Our experiments show that this approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data distribution, while taking equal or less time

arXiv.org e-Print Archive

CiteSeerX

Crossref

Warwick Research Archives Portal Repository

Improved probabilistic distance based locality preserving projections method to reduce dimensionality in large datasets

Author: Alostad Jasem M.
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/02/2019
Field of study

In this paper, a dimensionality reduction is achieved in large datasets using the proposed distance based Non-integer Matrix Factorization (NMF) technique, which is intended to solve the data dimensionality problem. Here, NMF and distance measurement aim to resolve the non-orthogonality problem due to increased dataset dimensionality. It initially partitions the datasets, organizes them into a defined geometric structure and it avoids capturing the dataset structure through a distance based similarity measurement. The proposed method is designed to fit the dynamic datasets and it includes the intrinsic structure using data geometry. Therefore, the complexity of data is further avoided using an Improved Distance based Locality Preserving Projection. The proposed method is evaluated against existing methods in terms of accuracy, average accuracy, mutual information and average mutual information

Crossref

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Institute of Advanced Engineering and Science

On wavelet decomposition of uncertain time series data sets

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

Crossref

Doctor of Philosophy

Author: Tang Mingwang
Publication venue: University of Utah
Publication date: 01/05/2015
Field of study

dissertationIn the era of big data, many applications generate continuous online data from distributed locations, scattering devices, etc. Examples include data from social media, financial services, and sensor networks, etc. Meanwhile, large volumes of data can be archived or stored offline in distributed locations for further data analysis. Challenges from data uncertainty, large-scale data size, and distributed data sources motivate us to revisit several classic problems for both online and offline data explorations. The problem of continuous threshold monitoring for distributed data is commonly encountered in many real-world applications. We study this problem for distributed probabilistic data. We show how to prune expensive threshold queries using various tail bounds and combine tail-bound techniques with adaptive algorithms for monitoring distributed deterministic data. We also show how to approximate threshold queries based on sampling techniques. Threshold monitoring problems can only tell a monitoring function is above or below a threshold constraint but not how far away from it. This motivates us to study the problem of continuous tracking functions over distributed data. We first investigate the tracking problem on a chain topology. Then we show how to solve tracking problems on a distributed setting using solutions for the chain model. We studied online tracking of the max function on ""broom"" tree and general tree topologies in this work. Finally, we examine building scalable histograms for distributed probabilistic data. We show how to build approximate histograms based on a partition-and-merge principle on a centralized machine. Then, we show how to extend our solutions to distributed and parallel settings to further mitigate scalability bottlenecks and deal with distributed data

The University of Utah: J. Willard Marriott Digital Library

View recommendation for visual data exploration

Author: Ehsan Humaira
Publication venue: 'University of Queensland Library'
Publication date: 20/12/2019
Field of study

University of Queensland eSpace

Histograms and Wavelets on Probabilistic Data

Author: Graham Cormode
Minos Garofalakis
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Histograms and wavelets on probabilistic data

Author: Cormode Graham
Garofalakis Minos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in probabilistic database systems. In this paper, we introduce definitions and algorithms for building histogram- and Haar wavelet-based synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal size B histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamic programming-based techniques for the deterministic domain. Our experiments show that this approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data distribution, while taking equal or less time

Warwick Research Archives Portal Repository

Histograms and wavelets on probabilistic data

Author: Cormode Graham, 1977-(http://viaf.org/viaf/163963839)
Garofalakis Minos(http://users.isc.tuc.gr/~mgarofalakis)
Γαροφαλακης Μινως(http://users.isc.tuc.gr/~mgarofalakis)
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Summarization: There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in probabilistic database systems. In this paper, we introduce definitions and algorithms for building histogram- and Haar wavelet-based synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal size B histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamic-programming-based techniques for the deterministic domain. Our experiments show that this approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data distribution, while taking equal or less time.Presented on: IEEE Transactions on Knowledge and Data Engineerin

Institutional Repository of the Technical University of Crete