20 research outputs found
Accurate Data Approximation in Constrained Environments
Several data reduction techniques have been proposed recently as
methods for providing fast and fairly accurate answers to complex
queries over large quantities of data. Their use has been widespread,
due to the multiple benefits that they may offer in several
constrained environments and applications. Compressed data representations
require less space to store, less bandwidth to communicate and can
provide, due to their size, very fast response times to
queries. Sensor networks represent a typical constrained environment,
due to the limited processing, storage and battery capabilities of the
sensor nodes.
Large-scale sensor networks require tight data handling and data
dissemination techniques. Transmitting a full-resolution data
feed from each sensor back to the base-station is often prohibitive
due to (i) limited bandwidth that may not be sufficient to sustain a
continuous feed from all sensors and (ii) increased power consumption
due to the wireless multi-hop communication. In order to minimize the
volume of the transmitted data, we can apply two well data reduction
techniques: aggregation and approximation.
In this dissertation we propose novel data reduction techniques for
the transmission of measurements collected in sensor network
environments. We first study the problem of summarizing multi-valued
data feeds generated at a single sensor node, a step necessary for the
transmission of large amounts of historical information collected at
the node. The transmission of these measurements may either be
periodic (i.e., when a certain amount of measurements has been
collected), or in response to a query from the base station. We then
also consider the approximate evaluation of aggregate
continuous queries. A continuous query is a query that runs
continuously until explicitly terminated by the user. These queries
can be used to obtain a live-estimate of some (aggregated) quantity,
such as the total number of moving objects detected by the sensors
Approximation algorithms for wavelet transform coding of data streams
This paper addresses the problem of finding a B-term wavelet representation
of a given discrete function whose distance from f is
minimized. The problem is well understood when we seek to minimize the
Euclidean distance between f and its representation. The first known algorithms
for finding provably approximate representations minimizing general
distances (including ) under a wide variety of compactly supported
wavelet bases are presented in this paper. For the Haar basis, a polynomial
time approximation scheme is demonstrated. These algorithms are applicable in
the one-pass sublinear-space data stream model of computation. They generalize
naturally to multiple dimensions and weighted norms. A universal representation
that provides a provable approximation guarantee under all p-norms
simultaneously; and the first approximation algorithms for bit-budget versions
of the problem, known as adaptive quantization, are also presented. Further, it
is shown that the algorithms presented here can be used to select a basis from
a tree-structured dictionary of bases and find a B-term representation of the
given function that provably approximates its best dictionary-basis
representation.Comment: Added a universal representation that provides a provable
approximation guarantee under all p-norms simultaneousl
Sample Footprints fĂŒr Data-Warehouse-Datenbanken
Durch stetig wachsende Datenmengen in aktuellen Data-Warehouse-Datenbanken erlangen Stichproben eine immer gröĂer werdende Bedeutung. Insbesondere interaktive Analysen können von den signifikant kĂŒrzeren Antwortzeiten der approximativen Anfrageverarbeitung erheblich profitieren. Linked-Bernoulli-Synopsen bieten in diesem Szenario speichereffiziente, schemaweite Synopsen, d.âŻh. Synopsen mit Stichproben jeder im Schema enthaltenen Tabelle bei minimalem Mehraufwand fĂŒr die Erhaltung der referenziellen IntegritĂ€t innerhalb der Synopse. Dies ermöglicht eine effiziente UnterstĂŒtzung der nĂ€herungsweisen Beantwortung von Anfragen mit beliebigen FremdschlĂŒsselverbundoperationen. In diesem Artikel wird der Einsatz von Linked-Bernoulli-Synopsen in Data-Warehouse-Umgebungen detaillierter analysiert. Dies beinhaltet zum einen die Konstruktion speicherplatzbeschrĂ€nkter, schemaweiter Synopsen, wobei unter anderem folgende Fragen adressiert werden: Wie kann der verfĂŒgbare Speicherplatz auf die einzelnen Stichproben aufgeteilt werden? Was sind die Auswirkungen auf den Mehraufwand? Zum anderen wird untersucht, wie Linked-Bernoulli-Synopsen fĂŒr die Verwendung in Data-Warehouse-Datenbanken angepasst werden können. HierfĂŒr werden eine inkrementelle Wartungsstrategie sowie eine Erweiterung um eine AusreiĂerbehandlung fĂŒr die Reduzierung von SchĂ€tzfehlern approximativer Antworten von Aggregationsanfragen mit FremdschlĂŒsselverbundoperationen vorgestellt. Eine Vielzahl von Experimenten zeigt, dass Linked-Bernoulli-Synopsen und die in diesem Artikel prĂ€sentierten Verfahren vielversprechend fĂŒr den Einsatz in Data-Warehouse-Datenbanken sind.With the amount of data in current data warehouse databases growing steadily, random sampling is continuously gaining in importance. In particular, interactive analyses of large datasets can greatly benefit from the significantly shorter response times of approximate query processing. In this scenario, Linked Bernoulli Synopses provide memory-efficient schema-level synopses, i.âŻe., synopses that consist of random samples of each table in the schema with minimal overhead for retaining foreign-key integrity within the synopsis. This provides efficient support to the approximate answering of queries with arbitrary foreign-key joins. In this article, we focus on the application of Linked Bernoulli Synopses in data warehouse environments. On the one hand, we analyze the instantiation of memory-bounded synopses. Among others, we address the following questions: How can the given space be partitioned among the individual samples? What is the impact on the overhead? On the other hand, we consider further adaptations of Linked Bernoulli Synopses for usage in data warehouse databases. We show how synopses can incrementally be kept up-to-date when the underlying data changes. Further, we suggest additional outlier handling methods to reduce the estimation error of approximate answers of aggregation queries with foreign-key joins. With a variety of experiments, we show that Linked Bernoulli Synopses and the proposed techniques have great potential in the context of data warehouse databases
PolyFit: Polynomial-based Indexing Approach for Fast Approximate Range Aggregate Queries
Range aggregate queries find frequent application in data analytics. In some
use cases, approximate results are preferred over accurate results if they can
be computed rapidly and satisfy approximation guarantees. Inspired by a recent
indexing approach, we provide means of representing a discrete point data set
by continuous functions that can then serve as compact index structures. More
specifically, we develop a polynomial-based indexing approach, called PolyFit,
for processing approximate range aggregate queries. PolyFit is capable of
supporting multiple types of range aggregate queries, including COUNT, SUM, MIN
and MAX aggregates, with guaranteed absolute and relative error bounds.
Experiment results show that PolyFit is faster and more accurate and compact
than existing learned index structures.Comment: 13 page
SciBORQ: Scientific data management with Bounds On Runtime and Quality
Data warehouses underlying virtual observatories stress the capabilities of database management systems in many ways. They are filled, on a daily basis, with large amounts of factual information derived from intensive data scrubbing and computational feature extraction pipelines. The predominant data processing techniques focus on parallel loads and map-reduce feature extraction algorithms. Querying these huge databases require a sizable computing cluster, while ideally the initial investigation should run interactively, using as few resources as possible. In this paper, we explore a different route, one based on the observation that at any given time only a fraction of the data is of primary value for a specific task. This fraction becomes the focus of scientific reflection through an iterative process of ad-hoc query refinement. Steering through data to facilitate scientific discovery demands guarantees for the query execution time. In addition, strict bounds on errors are required to satisfy the demands of scientific use, such that query results can be used to test hypotheses reliably. We propose SciBORQ, a framework for scientific data exploration that gives precise control over runtime and quality of query answering. We present novel techniques to derive multiple interesting data samples, called impressions. An impression is selected such that the statistical error of a query answer remains low, while the result can be computed within strict time bounds. Impressions differ from previous sampling approaches in their bias towards the focal point of the scientific data exploration, their multi-layer design, and their adaptiveness to shifting query workloads. The ultimate goal is a complete system for scientific data exploration and discovery, capable of producing quality answers with strict error bounds in pre-defined time frames. 1
Histograms and Wavelets on Probabilistic Data
There is a growing realization that uncertain information is a first-class
citizen in modern database management. As such, we need techniques to correctly
and efficiently process uncertain data in database systems. In particular, data
reduction techniques that can produce concise, accurate synopses of large
probabilistic relations are crucial. Similar to their deterministic relation
counterparts, such compact probabilistic data synopses can form the foundation
for human understanding and interactive data exploration, probabilistic query
planning and optimization, and fast approximate query processing in
probabilistic database systems.
In this paper, we introduce definitions and algorithms for building
histogram- and wavelet-based synopses on probabilistic data. The core problem
is to choose a set of histogram bucket boundaries or wavelet coefficients to
optimize the accuracy of the approximate representation of a collection of
probabilistic tuples under a given error metric. For a variety of different
error metrics, we devise efficient algorithms that construct optimal or near
optimal B-term histogram and wavelet synopses. This requires careful analysis
of the structure of the probability distributions, and novel extensions of
known dynamic-programming-based techniques for the deterministic domain. Our
experiments show that this approach clearly outperforms simple ideas, such as
building summaries for samples drawn from the data distribution, while taking
equal or less time
Algorithms for Big Data: Graphs and PageRank
This work consists of a study of a set of techniques and strategies related
with algorithm's design, whose purpose is the resolution of problems on massive
data sets, in an efficient way. This field is known as Algorithms for Big Data.
In particular, this work has studied the Streaming Algorithms, which represents
the basis of the data structures of sublinear order in space, known as
Sketches. In addition, it has deepened in the study of problems applied to
Graphs on the Semi-Streaming model. Next, the PageRank algorithm was analyzed
as a concrete case study. Finally, the development of a library for the
resolution of graph problems, implemented on the top of the intensive
mathematical computation platform known as TensorFlow has been started.Comment: in Spanish, 143 pages, final degree project (bachelor's thesis
Mining complex data in highly streaming environments
Data is growing at a rapid rate because of advanced hardware and software technologies and platforms such as e-health systems, sensor networks, and social media. One of the challenging problems is storing, processing and transferring this big data in an efficient and effective way. One solution to tackle these challenges is to construct synopsis by means of data summarization techniques. Motivated by the fact that without summarization, processing, analyzing and communicating this vast amount of data is inefficient, this thesis introduces new summarization frameworks with the main goals of reducing communication costs and accelerating data mining processes in different application scenarios. Specifically, we study the following big data summarizaion techniques:(i) dimensionality reduction;(ii)clustering,and(iii)histogram, considering their importance and wide use in various areas and domains. In our work, we propose three different frameworks using these summarization techniques to cover three different aspects of big data:"Volume","Velocity"and"Variety" in centralized and decentralized platforms. We use dimensionality reduction techniques for summarizing large 2D-arrays, clustering and histograms for processing multiple data streams. With respect to the importance and rapid growth of emerging e-health applications such as tele-radiology and tele-medicine that require fast, low cost, and often lossless access to massive amounts of medical images and data over band limited channels,our first framework attempts to summarize streams of large volume medical images (e.g. X-rays) for the purpose of compression. Significant amounts of correlation and redundancy exist across different medical images. These can be extracted and used as a data summary to achieve better compression, and consequently less storage and less communication overheads on the network. We propose a novel memory-assisted compression framework as a learning-based universal coding, which can be used to complement any existing algorithm to further eliminate redundancies/similarities across images. This approach is motivated by the fact that, often in medical applications, massive amounts of correlated images from the same family are available as training data for learning the dependencies and deriving appropriate reference or synopses models. The models can then be used for compression of any new image from the same family. In particular, dimensionality reduction techniques such as Principal Component Analysis (PCA) and Non-negative Matrix Factorization (NMF) are applied on a set of images from training data to form the required reference models. The proposed memory-assisted compression allows each image to be processed independently of other images, and hence allows individual image access and transmission. In the second part of our work,we investigate the problem of summarizing distributed multidimensional data streams using clustering. We devise a distributed clustering framework, DistClusTree, that extends the centralized ClusTree approach. The main difficulty in distributed clustering is balancing communication costs and clustering quality. We tackle this in DistClusTree through combining spatial index summaries and online tracking for efficient local and global incremental clustering. We demonstrate through extensive experiments the efficacy of the framework in terms of communication costs and approximate clustering quality. In the last part, we use a multidimensional index structure to merge distributed summaries in the form of a centralized histogram as another widely used summarization technique with the application in approximate range query answering. In this thesis, we propose the index-based Distributed Mergeable Summaries (iDMS) framework based on kd-trees that addresses these challenges with data generative models of Gaussian mixture models (GMMs) and a Generative Adversarial Network (GAN). iDMS maintains a global approximate kd-tree at a central site via GMMs or GANs upon new arrivals of streaming data at local sites. Experimental results validate the effectiveness and efficiency of iDMS against baseline distributed settings in terms of approximation error and communication costs