50 research outputs found
Histograms and Wavelets on Probabilistic Data
There is a growing realization that uncertain information is a first-class
citizen in modern database management. As such, we need techniques to correctly
and efficiently process uncertain data in database systems. In particular, data
reduction techniques that can produce concise, accurate synopses of large
probabilistic relations are crucial. Similar to their deterministic relation
counterparts, such compact probabilistic data synopses can form the foundation
for human understanding and interactive data exploration, probabilistic query
planning and optimization, and fast approximate query processing in
probabilistic database systems.
In this paper, we introduce definitions and algorithms for building
histogram- and wavelet-based synopses on probabilistic data. The core problem
is to choose a set of histogram bucket boundaries or wavelet coefficients to
optimize the accuracy of the approximate representation of a collection of
probabilistic tuples under a given error metric. For a variety of different
error metrics, we devise efficient algorithms that construct optimal or near
optimal B-term histogram and wavelet synopses. This requires careful analysis
of the structure of the probability distributions, and novel extensions of
known dynamic-programming-based techniques for the deterministic domain. Our
experiments show that this approach clearly outperforms simple ideas, such as
building summaries for samples drawn from the data distribution, while taking
equal or less time
Approximation algorithms for wavelet transform coding of data streams
This paper addresses the problem of finding a B-term wavelet representation
of a given discrete function whose distance from f is
minimized. The problem is well understood when we seek to minimize the
Euclidean distance between f and its representation. The first known algorithms
for finding provably approximate representations minimizing general
distances (including ) under a wide variety of compactly supported
wavelet bases are presented in this paper. For the Haar basis, a polynomial
time approximation scheme is demonstrated. These algorithms are applicable in
the one-pass sublinear-space data stream model of computation. They generalize
naturally to multiple dimensions and weighted norms. A universal representation
that provides a provable approximation guarantee under all p-norms
simultaneously; and the first approximation algorithms for bit-budget versions
of the problem, known as adaptive quantization, are also presented. Further, it
is shown that the algorithms presented here can be used to select a basis from
a tree-structured dictionary of bases and find a B-term representation of the
given function that provably approximates its best dictionary-basis
representation.Comment: Added a universal representation that provides a provable
approximation guarantee under all p-norms simultaneousl
Accurate Data Approximation in Constrained Environments
Several data reduction techniques have been proposed recently as
methods for providing fast and fairly accurate answers to complex
queries over large quantities of data. Their use has been widespread,
due to the multiple benefits that they may offer in several
constrained environments and applications. Compressed data representations
require less space to store, less bandwidth to communicate and can
provide, due to their size, very fast response times to
queries. Sensor networks represent a typical constrained environment,
due to the limited processing, storage and battery capabilities of the
sensor nodes.
Large-scale sensor networks require tight data handling and data
dissemination techniques. Transmitting a full-resolution data
feed from each sensor back to the base-station is often prohibitive
due to (i) limited bandwidth that may not be sufficient to sustain a
continuous feed from all sensors and (ii) increased power consumption
due to the wireless multi-hop communication. In order to minimize the
volume of the transmitted data, we can apply two well data reduction
techniques: aggregation and approximation.
In this dissertation we propose novel data reduction techniques for
the transmission of measurements collected in sensor network
environments. We first study the problem of summarizing multi-valued
data feeds generated at a single sensor node, a step necessary for the
transmission of large amounts of historical information collected at
the node. The transmission of these measurements may either be
periodic (i.e., when a certain amount of measurements has been
collected), or in response to a query from the base station. We then
also consider the approximate evaluation of aggregate
continuous queries. A continuous query is a query that runs
continuously until explicitly terminated by the user. These queries
can be used to obtain a live-estimate of some (aggregated) quantity,
such as the total number of moving objects detected by the sensors
QuickSel: Quick Selectivity Learning with Mixture Models
Estimating the selectivity of a query is a key step in almost any cost-based
query optimizer. Most of today's databases rely on histograms or samples that
are periodically refreshed by re-scanning the data as the underlying data
changes. Since frequent scans are costly, these statistics are often stale and
lead to poor selectivity estimates. As an alternative to scans, query-driven
histograms have been proposed, which refine the histograms based on the actual
selectivities of the observed queries. Unfortunately, these approaches are
either too costly to use in practice---i.e., require an exponential number of
buckets---or quickly lose their advantage as they observe more queries.
In this paper, we propose a selectivity learning framework, called QuickSel,
which falls into the query-driven paradigm but does not use histograms.
Instead, it builds an internal model of the underlying data, which can be
refined significantly faster (e.g., only 1.9 milliseconds for 300 queries).
This fast refinement allows QuickSel to continuously learn from each query and
yield increasingly more accurate selectivity estimates over time. Unlike
query-driven histograms, QuickSel relies on a mixture model and a new
optimization algorithm for training its model. Our extensive experiments on two
real-world datasets confirm that, given the same target accuracy, QuickSel is
34.0x-179.4x faster than state-of-the-art query-driven histograms, including
ISOMER and STHoles. Further, given the same space budget, QuickSel is
26.8%-91.8% more accurate than periodically-updated histograms and samples,
respectively
Query estimation techniques in database systems
The effctiveness of query optimization in database systems critically depends on the system\u27;s ability to assess the execution costs of different query execution plans. For this purpose, the sizes and data distributions of the intermediate results generated during plan execution need to be estimated as accurately as possible. This estimation requires the maintenance of statistics on the data stored in the database, which are referred to as data synopses.
While the problem of query cost estimation has received significant attention for over a decade, it has remained an open issue in practice, because most previous techniques have focused on singular aspects of the problem such as minimizing the estimation error of a single type of query and a single data distribution, whereas database management systems generally need to support
a wide range of queries over a number of datasets.
In this thesis I introduce a new technique for query result estimation, which extends existing techniques in that it offers estimation for all combinations of the three major database operators selection, projection, and join. The approach is based on separate and independent approximations of the attribute values contained in a dataset and their frequencies. Through the use of space-filling curves, the approach extends to multi-dimensional data, while maintaining its accuracy and computational properties. The resulting estimation accuracy is competitive with specialized techniques and superior to the histogram techniques currently implemented in commercial database management systems.
Because data synopses reside in main memory, they compete for available space with the database cache and query execution buffers. Consequently, the memory available to data synopses needs to be used efficiently. This results in a physical design problem for data synopses, which is to determine the best set of synopses for a given combination of datasets, queries, and available
memory. This thesis introduces a formalization of the problem, and efficient algorithmic solutions.
All discussed techniques are evaluated with regard to their overhead and resulting estimation accuracy on a variety of synthetic and real-life datasets.Die EffektivitĂ€t der Anfrage-Optimierung in Datenbanksystemen hĂ€ngt entscheidend von der FĂ€higkeit des Systems ab, die Kosten der verschiedenen Möglichkeiten, eine Anfrage auszufĂŒhren, abzuschĂ€tzen. Zu diesem Zweck ist es nötig, die GröĂen und Datenverteilungen der Zwischenresultate, die wĂ€hrend der AusfĂŒhrung einer Anfrage generiert werden, so genau wie möglich zu schĂ€tzen. Zur Lösung dieses SchĂ€tzproblems benötigt man Statistiken ĂŒber die Daten, welche in dem Datenbanksystem gespeichert werden; diese Statistiken werden auch als Daten Synopsen bezeichnet. Obwohl das Problem der SchĂ€tzung von Anfragekosten innerhalb der letzten 10 Jahre intensiv untersucht wurde, gilt es weiterhin als offen, da viele der vorgeschlagenen AnsĂ€tze nur einen Teilaspekt des Problems betrachten. In den meisten FĂ€llen wurden Techniken fĂŒr das AbschĂ€tzen eines einzelnen Operators auf einer einzelnen Datenverteilung untersucht, wohingegen Datenbanksysteme in der Praxis eine Vielfalt von Anfragen ĂŒber diverse DatensĂ€tze unterstĂŒtzen mĂŒssen. Aus diesem Grund stellt diese Arbeit einen neuen Ansatz zur ResultatsabschĂ€tzung vor, welcher insofern ĂŒber bestehende AnsĂ€tze hinausgeht, als dass er akkurate AbschĂ€tzung beliebiger Kombinationen der drei wichtigsten Datenbank-Operatoren erlaubt: Selektion, Projektion und Join. Meine Technik basiert auf separaten und unabhĂ€ngigen Approximationen der Verteilung der Attributwerte eines Datensatzes und der Verteilung der HĂ€ufigkeiten dieser Attributwerte. Durch den Einsatz raumfĂŒllender Kurven können diese Approximationstechniken zudem auf mehrdimensionale Datenverteilungen angewandt werden, ohne ihre Genauigkeit und geringen Berechnungskosten einzubĂŒĂen. Die resultierende SchĂ€tzgenauigkeit ist vergleichbar mit der von auf einen einzigen Operator spezialisierten Techniken, und deutlich höher als die der auf Histogrammen basierenden AnsĂ€tze, welche momentan in kommerziellen Datenbanksystemen eingesetzt werden. Da Daten Synopsen im Arbeitsspeicher residieren, reduzieren sie den Speicher, der fĂŒr den Seitencache oder AusfĂŒhrungspuffer zur VerfĂŒgung steht. Somit sollte der fĂŒr Synopsen reservierte Speicher effizient genutzt werden, bzw. möglichst klein sein. Dies fĂŒhrt zu dem Problem, die optimale Kombination von Synopsen fĂŒr eine gegebene Kombination an Daten, Anfragen und verfĂŒgbarem Speicher zu bestimmen. Diese Arbeit stellt eine formale Beschreibung des Problems, sowie effiziente Algorithmen zu dessen Lösung vor. Alle beschriebenen Techniken werden in Hinsicht auf ihren Aufwand und die resultierende SchĂ€tzgenauigkeit mittels Experimenten ĂŒber eine Vielzahl von Datenverteilungen evaluiert
Query estimation techniques in database systems
The effctiveness of query optimization in database systems critically depends on the system';s ability to assess the execution costs of different query execution plans. For this purpose, the sizes and data distributions of the intermediate results generated during plan execution need to be estimated as accurately as possible. This estimation requires the maintenance of statistics on the data stored in the database, which are referred to as data synopses.
While the problem of query cost estimation has received significant attention for over a decade, it has remained an open issue in practice, because most previous techniques have focused on singular aspects of the problem such as minimizing the estimation error of a single type of query and a single data distribution, whereas database management systems generally need to support
a wide range of queries over a number of datasets.
In this thesis I introduce a new technique for query result estimation, which extends existing techniques in that it offers estimation for all combinations of the three major database operators selection, projection, and join. The approach is based on separate and independent approximations of the attribute values contained in a dataset and their frequencies. Through the use of space-filling curves, the approach extends to multi-dimensional data, while maintaining its accuracy and computational properties. The resulting estimation accuracy is competitive with specialized techniques and superior to the histogram techniques currently implemented in commercial database management systems.
Because data synopses reside in main memory, they compete for available space with the database cache and query execution buffers. Consequently, the memory available to data synopses needs to be used efficiently. This results in a physical design problem for data synopses, which is to determine the best set of synopses for a given combination of datasets, queries, and available
memory. This thesis introduces a formalization of the problem, and efficient algorithmic solutions.
All discussed techniques are evaluated with regard to their overhead and resulting estimation accuracy on a variety of synthetic and real-life datasets.Die EffektivitĂ€t der Anfrage-Optimierung in Datenbanksystemen hĂ€ngt entscheidend von der FĂ€higkeit des Systems ab, die Kosten der verschiedenen Möglichkeiten, eine Anfrage auszufĂŒhren, abzuschĂ€tzen. Zu diesem Zweck ist es nötig, die GröĂen und Datenverteilungen der Zwischenresultate, die wĂ€hrend der AusfĂŒhrung einer Anfrage generiert werden, so genau wie möglich zu schĂ€tzen. Zur Lösung dieses SchĂ€tzproblems benötigt man Statistiken ĂŒber die Daten, welche in dem Datenbanksystem gespeichert werden; diese Statistiken werden auch als Daten Synopsen bezeichnet. Obwohl das Problem der SchĂ€tzung von Anfragekosten innerhalb der letzten 10 Jahre intensiv untersucht wurde, gilt es weiterhin als offen, da viele der vorgeschlagenen AnsĂ€tze nur einen Teilaspekt des Problems betrachten. In den meisten FĂ€llen wurden Techniken fĂŒr das AbschĂ€tzen eines einzelnen Operators auf einer einzelnen Datenverteilung untersucht, wohingegen Datenbanksysteme in der Praxis eine Vielfalt von Anfragen ĂŒber diverse DatensĂ€tze unterstĂŒtzen mĂŒssen. Aus diesem Grund stellt diese Arbeit einen neuen Ansatz zur ResultatsabschĂ€tzung vor, welcher insofern ĂŒber bestehende AnsĂ€tze hinausgeht, als dass er akkurate AbschĂ€tzung beliebiger Kombinationen der drei wichtigsten Datenbank-Operatoren erlaubt: Selektion, Projektion und Join. Meine Technik basiert auf separaten und unabhĂ€ngigen Approximationen der Verteilung der Attributwerte eines Datensatzes und der Verteilung der HĂ€ufigkeiten dieser Attributwerte. Durch den Einsatz raumfĂŒllender Kurven können diese Approximationstechniken zudem auf mehrdimensionale Datenverteilungen angewandt werden, ohne ihre Genauigkeit und geringen Berechnungskosten einzubĂŒĂen. Die resultierende SchĂ€tzgenauigkeit ist vergleichbar mit der von auf einen einzigen Operator spezialisierten Techniken, und deutlich höher als die der auf Histogrammen basierenden AnsĂ€tze, welche momentan in kommerziellen Datenbanksystemen eingesetzt werden. Da Daten Synopsen im Arbeitsspeicher residieren, reduzieren sie den Speicher, der fĂŒr den Seitencache oder AusfĂŒhrungspuffer zur VerfĂŒgung steht. Somit sollte der fĂŒr Synopsen reservierte Speicher effizient genutzt werden, bzw. möglichst klein sein. Dies fĂŒhrt zu dem Problem, die optimale Kombination von Synopsen fĂŒr eine gegebene Kombination an Daten, Anfragen und verfĂŒgbarem Speicher zu bestimmen. Diese Arbeit stellt eine formale Beschreibung des Problems, sowie effiziente Algorithmen zu dessen Lösung vor. Alle beschriebenen Techniken werden in Hinsicht auf ihren Aufwand und die resultierende SchĂ€tzgenauigkeit mittels Experimenten ĂŒber eine Vielzahl von Datenverteilungen evaluiert
Enabling Efficient and General Subpopulation Analytics in Multidimensional Data Streams
Todayâs large-scale services (e.g., video streaming platforms, data centers, sensor grids) need diverse real-time summary statistics across multiple subpopulations of multidimensional datasets. However, state-of-the-art frameworks do not offer general and accurate analytics in real time at reasonable costs. The root cause is the combinatorial explosion of data subpopulations and the diversity of summary statistics we need to monitor simultaneously. We present Hydra, an efficient framework for multidimensional analytics that presents a novel combination of using a âsketch of sketchesâ to avoid the overhead of monitoring exponentially-many subpopulations and universal sketching to ensure accurate estimates for multiple statistics. We build Hydra as an Apache Spark plugin and address practical system challenges to minimize overheads at scale. Across multiple real-world and synthetic multidimensional datasets, we show that Hydra can achieve robust error bounds and is an order of magnitude more efficient in terms of operational cost and memory footprint than existing frameworks (e.g., Spark, Druid) while ensuring interactive estimation times
Mining complex data in highly streaming environments
Data is growing at a rapid rate because of advanced hardware and software technologies and platforms such as e-health systems, sensor networks, and social media. One of the challenging problems is storing, processing and transferring this big data in an efficient and effective way. One solution to tackle these challenges is to construct synopsis by means of data summarization techniques. Motivated by the fact that without summarization, processing, analyzing and communicating this vast amount of data is inefficient, this thesis introduces new summarization frameworks with the main goals of reducing communication costs and accelerating data mining processes in different application scenarios. Specifically, we study the following big data summarizaion techniques:(i) dimensionality reduction;(ii)clustering,and(iii)histogram, considering their importance and wide use in various areas and domains. In our work, we propose three different frameworks using these summarization techniques to cover three different aspects of big data:"Volume","Velocity"and"Variety" in centralized and decentralized platforms. We use dimensionality reduction techniques for summarizing large 2D-arrays, clustering and histograms for processing multiple data streams. With respect to the importance and rapid growth of emerging e-health applications such as tele-radiology and tele-medicine that require fast, low cost, and often lossless access to massive amounts of medical images and data over band limited channels,our first framework attempts to summarize streams of large volume medical images (e.g. X-rays) for the purpose of compression. Significant amounts of correlation and redundancy exist across different medical images. These can be extracted and used as a data summary to achieve better compression, and consequently less storage and less communication overheads on the network. We propose a novel memory-assisted compression framework as a learning-based universal coding, which can be used to complement any existing algorithm to further eliminate redundancies/similarities across images. This approach is motivated by the fact that, often in medical applications, massive amounts of correlated images from the same family are available as training data for learning the dependencies and deriving appropriate reference or synopses models. The models can then be used for compression of any new image from the same family. In particular, dimensionality reduction techniques such as Principal Component Analysis (PCA) and Non-negative Matrix Factorization (NMF) are applied on a set of images from training data to form the required reference models. The proposed memory-assisted compression allows each image to be processed independently of other images, and hence allows individual image access and transmission. In the second part of our work,we investigate the problem of summarizing distributed multidimensional data streams using clustering. We devise a distributed clustering framework, DistClusTree, that extends the centralized ClusTree approach. The main difficulty in distributed clustering is balancing communication costs and clustering quality. We tackle this in DistClusTree through combining spatial index summaries and online tracking for efficient local and global incremental clustering. We demonstrate through extensive experiments the efficacy of the framework in terms of communication costs and approximate clustering quality. In the last part, we use a multidimensional index structure to merge distributed summaries in the form of a centralized histogram as another widely used summarization technique with the application in approximate range query answering. In this thesis, we propose the index-based Distributed Mergeable Summaries (iDMS) framework based on kd-trees that addresses these challenges with data generative models of Gaussian mixture models (GMMs) and a Generative Adversarial Network (GAN). iDMS maintains a global approximate kd-tree at a central site via GMMs or GANs upon new arrivals of streaming data at local sites. Experimental results validate the effectiveness and efficiency of iDMS against baseline distributed settings in terms of approximation error and communication costs