498 research outputs found
Histograms and Wavelets on Probabilistic Data
There is a growing realization that uncertain information is a first-class
citizen in modern database management. As such, we need techniques to correctly
and efficiently process uncertain data in database systems. In particular, data
reduction techniques that can produce concise, accurate synopses of large
probabilistic relations are crucial. Similar to their deterministic relation
counterparts, such compact probabilistic data synopses can form the foundation
for human understanding and interactive data exploration, probabilistic query
planning and optimization, and fast approximate query processing in
probabilistic database systems.
In this paper, we introduce definitions and algorithms for building
histogram- and wavelet-based synopses on probabilistic data. The core problem
is to choose a set of histogram bucket boundaries or wavelet coefficients to
optimize the accuracy of the approximate representation of a collection of
probabilistic tuples under a given error metric. For a variety of different
error metrics, we devise efficient algorithms that construct optimal or near
optimal B-term histogram and wavelet synopses. This requires careful analysis
of the structure of the probability distributions, and novel extensions of
known dynamic-programming-based techniques for the deterministic domain. Our
experiments show that this approach clearly outperforms simple ideas, such as
building summaries for samples drawn from the data distribution, while taking
equal or less time
Accurate Data Approximation in Constrained Environments
Several data reduction techniques have been proposed recently as
methods for providing fast and fairly accurate answers to complex
queries over large quantities of data. Their use has been widespread,
due to the multiple benefits that they may offer in several
constrained environments and applications. Compressed data representations
require less space to store, less bandwidth to communicate and can
provide, due to their size, very fast response times to
queries. Sensor networks represent a typical constrained environment,
due to the limited processing, storage and battery capabilities of the
sensor nodes.
Large-scale sensor networks require tight data handling and data
dissemination techniques. Transmitting a full-resolution data
feed from each sensor back to the base-station is often prohibitive
due to (i) limited bandwidth that may not be sufficient to sustain a
continuous feed from all sensors and (ii) increased power consumption
due to the wireless multi-hop communication. In order to minimize the
volume of the transmitted data, we can apply two well data reduction
techniques: aggregation and approximation.
In this dissertation we propose novel data reduction techniques for
the transmission of measurements collected in sensor network
environments. We first study the problem of summarizing multi-valued
data feeds generated at a single sensor node, a step necessary for the
transmission of large amounts of historical information collected at
the node. The transmission of these measurements may either be
periodic (i.e., when a certain amount of measurements has been
collected), or in response to a query from the base station. We then
also consider the approximate evaluation of aggregate
continuous queries. A continuous query is a query that runs
continuously until explicitly terminated by the user. These queries
can be used to obtain a live-estimate of some (aggregated) quantity,
such as the total number of moving objects detected by the sensors
Fractal image compression and the self-affinity assumption : a stochastic signal modelling perspective
Bibliography: p. 208-225.Fractal image compression is a comparatively new technique which has gained considerable attention in the popular technical press, and more recently in the research literature. The most significant advantages claimed are high reconstruction quality at low coding rates, rapid decoding, and "resolution independence" in the sense that an encoded image may be decoded at a higher resolution than the original. While many of the claims published in the popular technical press are clearly extravagant, it appears from the rapidly growing body of published research that fractal image compression is capable of performance comparable with that of other techniques enjoying the benefit of a considerably more robust theoretical foundation. . So called because of the similarities between the form of image representation and a mechanism widely used in generating deterministic fractal images, fractal compression represents an image by the parameters of a set of affine transforms on image blocks under which the image is approximately invariant. Although the conditions imposed on these transforms may be shown to be sufficient to guarantee that an approximation of the original image can be reconstructed, there is no obvious theoretical reason to expect this to represent an efficient representation for image coding purposes. The usual analogy with vector quantisation, in which each image is considered to be represented in terms of code vectors extracted from the image itself is instructive, but transforms the fundamental problem into one of understanding why this construction results in an efficient codebook. The signal property required for such a codebook to be effective, termed "self-affinity", is poorly understood. A stochastic signal model based examination of this property is the primary contribution of this dissertation. The most significant findings (subject to some important restrictions} are that "self-affinity" is not a natural consequence of common statistical assumptions but requires particular conditions which are inadequately characterised by second order statistics, and that "natural" images are only marginally "self-affine", to the extent that fractal image compression is effective, but not more so than comparable standard vector quantisation techniques
Building Wavelet Histograms on Large Data in MapReduce
MapReduce is becoming the de facto framework for storing and processing
massive data, due to its excellent scalability, reliability, and elasticity. In
many MapReduce applications, obtaining a compact accurate summary of data is
essential. Among various data summarization tools, histograms have proven to be
particularly important and useful for summarizing data, and the wavelet
histogram is one of the most widely used histograms. In this paper, we
investigate the problem of building wavelet histograms efficiently on large
datasets in MapReduce. We measure the efficiency of the algorithms by both
end-to-end running time and communication cost. We demonstrate straightforward
adaptations of existing exact and approximate methods for building wavelet
histograms to MapReduce clusters are highly inefficient. To that end, we design
new algorithms for computing exact and approximate wavelet histograms and
discuss their implementation in MapReduce. We illustrate our techniques in
Hadoop, and compare to baseline solutions with extensive experiments performed
in a heterogeneous Hadoop cluster of 16 nodes, using large real and synthetic
datasets, up to hundreds of gigabytes. The results suggest significant (often
orders of magnitude) performance improvement achieved by our new algorithms.Comment: VLDB201
Surface radiation budget for climate applications
The Surface Radiation Budget (SRB) consists of the upwelling and downwelling radiation fluxes at the surface, separately determined for the broadband shortwave (SW) (0 to 5 micron) and longwave (LW) (greater than 5 microns) spectral regions plus certain key parameters that control these fluxes, specifically, SW albedo, LW emissivity, and surface temperature. The uses and requirements for SRB data, critical assessment of current capabilities for producing these data, and directions for future research are presented
Query estimation techniques in database systems
The effctiveness of query optimization in database systems critically depends on the system\u27;s ability to assess the execution costs of different query execution plans. For this purpose, the sizes and data distributions of the intermediate results generated during plan execution need to be estimated as accurately as possible. This estimation requires the maintenance of statistics on the data stored in the database, which are referred to as data synopses.
While the problem of query cost estimation has received significant attention for over a decade, it has remained an open issue in practice, because most previous techniques have focused on singular aspects of the problem such as minimizing the estimation error of a single type of query and a single data distribution, whereas database management systems generally need to support
a wide range of queries over a number of datasets.
In this thesis I introduce a new technique for query result estimation, which extends existing techniques in that it offers estimation for all combinations of the three major database operators selection, projection, and join. The approach is based on separate and independent approximations of the attribute values contained in a dataset and their frequencies. Through the use of space-filling curves, the approach extends to multi-dimensional data, while maintaining its accuracy and computational properties. The resulting estimation accuracy is competitive with specialized techniques and superior to the histogram techniques currently implemented in commercial database management systems.
Because data synopses reside in main memory, they compete for available space with the database cache and query execution buffers. Consequently, the memory available to data synopses needs to be used efficiently. This results in a physical design problem for data synopses, which is to determine the best set of synopses for a given combination of datasets, queries, and available
memory. This thesis introduces a formalization of the problem, and efficient algorithmic solutions.
All discussed techniques are evaluated with regard to their overhead and resulting estimation accuracy on a variety of synthetic and real-life datasets.Die Effektivität der Anfrage-Optimierung in Datenbanksystemen hängt entscheidend von der Fähigkeit des Systems ab, die Kosten der verschiedenen Möglichkeiten, eine Anfrage auszuführen, abzuschätzen. Zu diesem Zweck ist es nötig, die Größen und Datenverteilungen der Zwischenresultate, die während der Ausführung einer Anfrage generiert werden, so genau wie möglich zu schätzen. Zur Lösung dieses Schätzproblems benötigt man Statistiken über die Daten, welche in dem Datenbanksystem gespeichert werden; diese Statistiken werden auch als Daten Synopsen bezeichnet. Obwohl das Problem der Schätzung von Anfragekosten innerhalb der letzten 10 Jahre intensiv untersucht wurde, gilt es weiterhin als offen, da viele der vorgeschlagenen Ansätze nur einen Teilaspekt des Problems betrachten. In den meisten Fällen wurden Techniken für das Abschätzen eines einzelnen Operators auf einer einzelnen Datenverteilung untersucht, wohingegen Datenbanksysteme in der Praxis eine Vielfalt von Anfragen über diverse Datensätze unterstützen müssen. Aus diesem Grund stellt diese Arbeit einen neuen Ansatz zur Resultatsabschätzung vor, welcher insofern über bestehende Ansätze hinausgeht, als dass er akkurate Abschätzung beliebiger Kombinationen der drei wichtigsten Datenbank-Operatoren erlaubt: Selektion, Projektion und Join. Meine Technik basiert auf separaten und unabhängigen Approximationen der Verteilung der Attributwerte eines Datensatzes und der Verteilung der Häufigkeiten dieser Attributwerte. Durch den Einsatz raumfüllender Kurven können diese Approximationstechniken zudem auf mehrdimensionale Datenverteilungen angewandt werden, ohne ihre Genauigkeit und geringen Berechnungskosten einzubüßen. Die resultierende Schätzgenauigkeit ist vergleichbar mit der von auf einen einzigen Operator spezialisierten Techniken, und deutlich höher als die der auf Histogrammen basierenden Ansätze, welche momentan in kommerziellen Datenbanksystemen eingesetzt werden. Da Daten Synopsen im Arbeitsspeicher residieren, reduzieren sie den Speicher, der für den Seitencache oder Ausführungspuffer zur Verfügung steht. Somit sollte der für Synopsen reservierte Speicher effizient genutzt werden, bzw. möglichst klein sein. Dies führt zu dem Problem, die optimale Kombination von Synopsen für eine gegebene Kombination an Daten, Anfragen und verfügbarem Speicher zu bestimmen. Diese Arbeit stellt eine formale Beschreibung des Problems, sowie effiziente Algorithmen zu dessen Lösung vor. Alle beschriebenen Techniken werden in Hinsicht auf ihren Aufwand und die resultierende Schätzgenauigkeit mittels Experimenten über eine Vielzahl von Datenverteilungen evaluiert
- …