Search CORE

65 research outputs found

Histograms and Wavelets on Probabilistic Data

Author: Cormode Graham
Garofalakis Minos
Publication venue
Publication date: 01/01/2008
Field of study

There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in probabilistic database systems. In this paper, we introduce definitions and algorithms for building histogram- and wavelet-based synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal B-term histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamic-programming-based techniques for the deterministic domain. Our experiments show that this approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data distribution, while taking equal or less time

arXiv.org e-Print Archive

CiteSeerX

Crossref

Warwick Research Archives Portal Repository

Approximation algorithms for wavelet transform coding of data streams

Author: Guha Sudipto
Harb Boulos
Publication venue
Publication date: 01/01/2006
Field of study

This paper addresses the problem of finding a B-term wavelet representation of a given discrete function

f \in \real^n

whose distance from f is minimized. The problem is well understood when we seek to minimize the Euclidean distance between f and its representation. The first known algorithms for finding provably approximate representations minimizing general

\ell_p

distances (including

\ell_\infty

) under a wide variety of compactly supported wavelet bases are presented in this paper. For the Haar basis, a polynomial time approximation scheme is demonstrated. These algorithms are applicable in the one-pass sublinear-space data stream model of computation. They generalize naturally to multiple dimensions and weighted norms. A universal representation that provides a provable approximation guarantee under all p-norms simultaneously; and the first approximation algorithms for bit-budget versions of the problem, known as adaptive quantization, are also presented. Further, it is shown that the algorithms presented here can be used to select a basis from a tree-structured dictionary of bases and find a B-term representation of the given function that provably approximates its best dictionary-basis representation.Comment: Added a universal representation that provides a provable approximation guarantee under all p-norms simultaneousl

arXiv.org e-Print Archive

CiteSeerX

ScholarlyCommons@Penn

Accurate Data Approximation in Constrained Environments

Author: Deligiannakis Antonios
Publication venue
Publication date: 15/06/2005
Field of study

Several data reduction techniques have been proposed recently as methods for providing fast and fairly accurate answers to complex queries over large quantities of data. Their use has been widespread, due to the multiple benefits that they may offer in several constrained environments and applications. Compressed data representations require less space to store, less bandwidth to communicate and can provide, due to their size, very fast response times to queries. Sensor networks represent a typical constrained environment, due to the limited processing, storage and battery capabilities of the sensor nodes. Large-scale sensor networks require tight data handling and data dissemination techniques. Transmitting a full-resolution data feed from each sensor back to the base-station is often prohibitive due to (i) limited bandwidth that may not be sufficient to sustain a continuous feed from all sensors and (ii) increased power consumption due to the wireless multi-hop communication. In order to minimize the volume of the transmitted data, we can apply two well data reduction techniques: aggregation and approximation. In this dissertation we propose novel data reduction techniques for the transmission of measurements collected in sensor network environments. We first study the problem of summarizing multi-valued data feeds generated at a single sensor node, a step necessary for the transmission of large amounts of historical information collected at the node. The transmission of these measurements may either be periodic (i.e., when a certain amount of measurements has been collected), or in response to a query from the base station. We then also consider the approximate evaluation of aggregate continuous queries. A continuous query is a query that runs continuously until explicitly terminated by the user. These queries can be used to obtain a live-estimate of some (aggregated) quantity, such as the total number of moving objects detected by the sensors

Digital Repository at the University of Maryland

PolyFit: Polynomial-based Indexing Approach for Fast Approximate Range Aggregate Queries

Author: Chan Tsz Nam
Jensen Christian S.
Li Zhe
Yiu Man Lung
Publication venue
Publication date: 01/01/2021
Field of study

Range aggregate queries find frequent application in data analytics. In some use cases, approximate results are preferred over accurate results if they can be computed rapidly and satisfy approximation guarantees. Inspired by a recent indexing approach, we provide means of representing a discrete point data set by continuous functions that can then serve as compact index structures. More specifically, we develop a polynomial-based indexing approach, called PolyFit, for processing approximate range aggregate queries. PolyFit is capable of supporting multiple types of range aggregate queries, including COUNT, SUM, MIN and MAX aggregates, with guaranteed absolute and relative error bounds. Experiment results show that PolyFit is faster and more accurate and compact than existing learned index structures.Comment: 13 page

arXiv.org e-Print Archive

VBN

Query estimation techniques in database systems

Author: König Arnd Christian
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 23/09/2004
Field of study

The effctiveness of query optimization in database systems critically depends on the system';s ability to assess the execution costs of different query execution plans. For this purpose, the sizes and data distributions of the intermediate results generated during plan execution need to be estimated as accurately as possible. This estimation requires the maintenance of statistics on the data stored in the database, which are referred to as data synopses. While the problem of query cost estimation has received significant attention for over a decade, it has remained an open issue in practice, because most previous techniques have focused on singular aspects of the problem such as minimizing the estimation error of a single type of query and a single data distribution, whereas database management systems generally need to support a wide range of queries over a number of datasets. In this thesis I introduce a new technique for query result estimation, which extends existing techniques in that it offers estimation for all combinations of the three major database operators selection, projection, and join. The approach is based on separate and independent approximations of the attribute values contained in a dataset and their frequencies. Through the use of space-filling curves, the approach extends to multi-dimensional data, while maintaining its accuracy and computational properties. The resulting estimation accuracy is competitive with specialized techniques and superior to the histogram techniques currently implemented in commercial database management systems. Because data synopses reside in main memory, they compete for available space with the database cache and query execution buffers. Consequently, the memory available to data synopses needs to be used efficiently. This results in a physical design problem for data synopses, which is to determine the best set of synopses for a given combination of datasets, queries, and available memory. This thesis introduces a formalization of the problem, and efficient algorithmic solutions. All discussed techniques are evaluated with regard to their overhead and resulting estimation accuracy on a variety of synthetic and real-life datasets.Die Effektivität der Anfrage-Optimierung in Datenbanksystemen hängt entscheidend von der Fähigkeit des Systems ab, die Kosten der verschiedenen Möglichkeiten, eine Anfrage auszuführen, abzuschätzen. Zu diesem Zweck ist es nötig, die Größen und Datenverteilungen der Zwischenresultate, die während der Ausführung einer Anfrage generiert werden, so genau wie möglich zu schätzen. Zur Lösung dieses Schätzproblems benötigt man Statistiken über die Daten, welche in dem Datenbanksystem gespeichert werden; diese Statistiken werden auch als Daten Synopsen bezeichnet. Obwohl das Problem der Schätzung von Anfragekosten innerhalb der letzten 10 Jahre intensiv untersucht wurde, gilt es weiterhin als offen, da viele der vorgeschlagenen Ansätze nur einen Teilaspekt des Problems betrachten. In den meisten Fällen wurden Techniken für das Abschätzen eines einzelnen Operators auf einer einzelnen Datenverteilung untersucht, wohingegen Datenbanksysteme in der Praxis eine Vielfalt von Anfragen über diverse Datensätze unterstützen müssen. Aus diesem Grund stellt diese Arbeit einen neuen Ansatz zur Resultatsabschätzung vor, welcher insofern über bestehende Ansätze hinausgeht, als dass er akkurate Abschätzung beliebiger Kombinationen der drei wichtigsten Datenbank-Operatoren erlaubt: Selektion, Projektion und Join. Meine Technik basiert auf separaten und unabhängigen Approximationen der Verteilung der Attributwerte eines Datensatzes und der Verteilung der Häufigkeiten dieser Attributwerte. Durch den Einsatz raumfüllender Kurven können diese Approximationstechniken zudem auf mehrdimensionale Datenverteilungen angewandt werden, ohne ihre Genauigkeit und geringen Berechnungskosten einzubüßen. Die resultierende Schätzgenauigkeit ist vergleichbar mit der von auf einen einzigen Operator spezialisierten Techniken, und deutlich höher als die der auf Histogrammen basierenden Ansätze, welche momentan in kommerziellen Datenbanksystemen eingesetzt werden. Da Daten Synopsen im Arbeitsspeicher residieren, reduzieren sie den Speicher, der für den Seitencache oder Ausführungspuffer zur Verfügung steht. Somit sollte der für Synopsen reservierte Speicher effizient genutzt werden, bzw. möglichst klein sein. Dies führt zu dem Problem, die optimale Kombination von Synopsen für eine gegebene Kombination an Daten, Anfragen und verfügbarem Speicher zu bestimmen. Diese Arbeit stellt eine formale Beschreibung des Problems, sowie effiziente Algorithmen zu dessen Lösung vor. Alle beschriebenen Techniken werden in Hinsicht auf ihren Aufwand und die resultierende Schätzgenauigkeit mittels Experimenten über eine Vielzahl von Datenverteilungen evaluiert

Universaar

Acronym

Query estimation techniques in database systems

Author: König Arnd Christian
Publication venue: Fakultät 6 - Naturwissenschaftlich-Technische Fakultät I. Fachrichtung 6.2 - Informatik
Publication date: 01/01/2001
Field of study

The effctiveness of query optimization in database systems critically depends on the system\u27;s ability to assess the execution costs of different query execution plans. For this purpose, the sizes and data distributions of the intermediate results generated during plan execution need to be estimated as accurately as possible. This estimation requires the maintenance of statistics on the data stored in the database, which are referred to as data synopses. While the problem of query cost estimation has received significant attention for over a decade, it has remained an open issue in practice, because most previous techniques have focused on singular aspects of the problem such as minimizing the estimation error of a single type of query and a single data distribution, whereas database management systems generally need to support a wide range of queries over a number of datasets. In this thesis I introduce a new technique for query result estimation, which extends existing techniques in that it offers estimation for all combinations of the three major database operators selection, projection, and join. The approach is based on separate and independent approximations of the attribute values contained in a dataset and their frequencies. Through the use of space-filling curves, the approach extends to multi-dimensional data, while maintaining its accuracy and computational properties. The resulting estimation accuracy is competitive with specialized techniques and superior to the histogram techniques currently implemented in commercial database management systems. Because data synopses reside in main memory, they compete for available space with the database cache and query execution buffers. Consequently, the memory available to data synopses needs to be used efficiently. This results in a physical design problem for data synopses, which is to determine the best set of synopses for a given combination of datasets, queries, and available memory. This thesis introduces a formalization of the problem, and efficient algorithmic solutions. All discussed techniques are evaluated with regard to their overhead and resulting estimation accuracy on a variety of synthetic and real-life datasets.Die Effektivität der Anfrage-Optimierung in Datenbanksystemen hängt entscheidend von der Fähigkeit des Systems ab, die Kosten der verschiedenen Möglichkeiten, eine Anfrage auszuführen, abzuschätzen. Zu diesem Zweck ist es nötig, die Größen und Datenverteilungen der Zwischenresultate, die während der Ausführung einer Anfrage generiert werden, so genau wie möglich zu schätzen. Zur Lösung dieses Schätzproblems benötigt man Statistiken über die Daten, welche in dem Datenbanksystem gespeichert werden; diese Statistiken werden auch als Daten Synopsen bezeichnet. Obwohl das Problem der Schätzung von Anfragekosten innerhalb der letzten 10 Jahre intensiv untersucht wurde, gilt es weiterhin als offen, da viele der vorgeschlagenen Ansätze nur einen Teilaspekt des Problems betrachten. In den meisten Fällen wurden Techniken für das Abschätzen eines einzelnen Operators auf einer einzelnen Datenverteilung untersucht, wohingegen Datenbanksysteme in der Praxis eine Vielfalt von Anfragen über diverse Datensätze unterstützen müssen. Aus diesem Grund stellt diese Arbeit einen neuen Ansatz zur Resultatsabschätzung vor, welcher insofern über bestehende Ansätze hinausgeht, als dass er akkurate Abschätzung beliebiger Kombinationen der drei wichtigsten Datenbank-Operatoren erlaubt: Selektion, Projektion und Join. Meine Technik basiert auf separaten und unabhängigen Approximationen der Verteilung der Attributwerte eines Datensatzes und der Verteilung der Häufigkeiten dieser Attributwerte. Durch den Einsatz raumfüllender Kurven können diese Approximationstechniken zudem auf mehrdimensionale Datenverteilungen angewandt werden, ohne ihre Genauigkeit und geringen Berechnungskosten einzubüßen. Die resultierende Schätzgenauigkeit ist vergleichbar mit der von auf einen einzigen Operator spezialisierten Techniken, und deutlich höher als die der auf Histogrammen basierenden Ansätze, welche momentan in kommerziellen Datenbanksystemen eingesetzt werden. Da Daten Synopsen im Arbeitsspeicher residieren, reduzieren sie den Speicher, der für den Seitencache oder Ausführungspuffer zur Verfügung steht. Somit sollte der für Synopsen reservierte Speicher effizient genutzt werden, bzw. möglichst klein sein. Dies führt zu dem Problem, die optimale Kombination von Synopsen für eine gegebene Kombination an Daten, Anfragen und verfügbarem Speicher zu bestimmen. Diese Arbeit stellt eine formale Beschreibung des Problems, sowie effiziente Algorithmen zu dessen Lösung vor. Alle beschriebenen Techniken werden in Hinsicht auf ihren Aufwand und die resultierende Schätzgenauigkeit mittels Experimenten über eine Vielzahl von Datenverteilungen evaluiert

CiteSeerX