121 research outputs found

    Approximation algorithms for wavelet transform coding of data streams

    Get PDF
    This paper addresses the problem of finding a B-term wavelet representation of a given discrete function f∈ℜnf \in \real^n whose distance from f is minimized. The problem is well understood when we seek to minimize the Euclidean distance between f and its representation. The first known algorithms for finding provably approximate representations minimizing general ℓp\ell_p distances (including ℓ∞\ell_\infty) under a wide variety of compactly supported wavelet bases are presented in this paper. For the Haar basis, a polynomial time approximation scheme is demonstrated. These algorithms are applicable in the one-pass sublinear-space data stream model of computation. They generalize naturally to multiple dimensions and weighted norms. A universal representation that provides a provable approximation guarantee under all p-norms simultaneously; and the first approximation algorithms for bit-budget versions of the problem, known as adaptive quantization, are also presented. Further, it is shown that the algorithms presented here can be used to select a basis from a tree-structured dictionary of bases and find a B-term representation of the given function that provably approximates its best dictionary-basis representation.Comment: Added a universal representation that provides a provable approximation guarantee under all p-norms simultaneousl

    Global Slope Change Synopses for Measurement Maps

    Get PDF
    Quality control using scalar quality measures is standard practice in manufacturing. However, there are also quality measures that are determined at a large number of positions on a product, since the spatial distribution is important. We denote such a mapping of local coordinates on the product to values of a measure as a measurement map. In this paper, we examine how measurement maps can be clustered according to a novel notion of similarity - mapscape similarity - that considers the overall course of the measure on the map. We present a class of synopses called global slope change that uses the profile of the measure along several lines from a reference point to different points on the borders to represent a measurement map. We conduct an evaluation of global slope change using a real-world data set from manufacturing and demonstrate its superiority over other synopses

    Histograms and Wavelets on Probabilistic Data

    Full text link
    There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in probabilistic database systems. In this paper, we introduce definitions and algorithms for building histogram- and wavelet-based synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal B-term histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamic-programming-based techniques for the deterministic domain. Our experiments show that this approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data distribution, while taking equal or less time

    Constructing fading histograms from data streams

    Get PDF
    The ability to collect data is changing drastically. Nowadays, data are gathered in the form of transient and finite data streams. Memory restrictions preclude keeping all received data in memory. When dealing with massive data streams, it is mandatory to create compact representations of data, also known as synopses structures or summaries. Reducing memory occupancy is of utmost importance when handling a huge amount of data. This paper addresses the problem of constructing histograms from data streams under error constraints. When constructing online histograms from data streams there are two main characteristics to embrace: the updating facility and the error of the histogram. Moreover, in dynamic environments, besides the need of compact summaries to capture the most important properties of data, it is also essential to forget old data. Therefore, this paper presents sliding histograms and fading histograms, an abrupt and a smooth strategies to forget outdated data

    The Use of the Discrete Wavelet Transform to Perform High Level Data Compression for Applications in Telemedicine

    Get PDF
    The document includes an executive summary of the program activities; questions regarding tiling that have yet to be addressed; and the impact of the grants received which include MDACC Infrastructure development, support of technology transfer, and the technical accomplishments of the program

    How to evaluate multiple range-sum queries progressively

    Get PDF
    Decision support system users typically submit batches of range-sum queries simultaneously rather than issuing individual, unrelated queries. We propose a wavelet based technique that exploits I/O sharing across a query batch to evaluate the set of queries progressively and efficiently. The challenge is that now controlling the structure of errors across query results becomes more critical than minimizing error per individual query. Consequently, we define a class of structural error penalty functions and show how they are controlled by our technique. Experiments demonstrate that our technique is efficient as an exact algorithm, and the progressive estimates are accurate, even after less than one I/O per query

    Query estimation techniques in database systems

    Get PDF
    The effctiveness of query optimization in database systems critically depends on the system';s ability to assess the execution costs of different query execution plans. For this purpose, the sizes and data distributions of the intermediate results generated during plan execution need to be estimated as accurately as possible. This estimation requires the maintenance of statistics on the data stored in the database, which are referred to as data synopses. While the problem of query cost estimation has received significant attention for over a decade, it has remained an open issue in practice, because most previous techniques have focused on singular aspects of the problem such as minimizing the estimation error of a single type of query and a single data distribution, whereas database management systems generally need to support a wide range of queries over a number of datasets. In this thesis I introduce a new technique for query result estimation, which extends existing techniques in that it offers estimation for all combinations of the three major database operators selection, projection, and join. The approach is based on separate and independent approximations of the attribute values contained in a dataset and their frequencies. Through the use of space-filling curves, the approach extends to multi-dimensional data, while maintaining its accuracy and computational properties. The resulting estimation accuracy is competitive with specialized techniques and superior to the histogram techniques currently implemented in commercial database management systems. Because data synopses reside in main memory, they compete for available space with the database cache and query execution buffers. Consequently, the memory available to data synopses needs to be used efficiently. This results in a physical design problem for data synopses, which is to determine the best set of synopses for a given combination of datasets, queries, and available memory. This thesis introduces a formalization of the problem, and efficient algorithmic solutions. All discussed techniques are evaluated with regard to their overhead and resulting estimation accuracy on a variety of synthetic and real-life datasets.Die EffektivitĂ€t der Anfrage-Optimierung in Datenbanksystemen hĂ€ngt entscheidend von der FĂ€higkeit des Systems ab, die Kosten der verschiedenen Möglichkeiten, eine Anfrage auszufĂŒhren, abzuschĂ€tzen. Zu diesem Zweck ist es nötig, die GrĂ¶ĂŸen und Datenverteilungen der Zwischenresultate, die wĂ€hrend der AusfĂŒhrung einer Anfrage generiert werden, so genau wie möglich zu schĂ€tzen. Zur Lösung dieses SchĂ€tzproblems benötigt man Statistiken ĂŒber die Daten, welche in dem Datenbanksystem gespeichert werden; diese Statistiken werden auch als Daten Synopsen bezeichnet. Obwohl das Problem der SchĂ€tzung von Anfragekosten innerhalb der letzten 10 Jahre intensiv untersucht wurde, gilt es weiterhin als offen, da viele der vorgeschlagenen AnsĂ€tze nur einen Teilaspekt des Problems betrachten. In den meisten FĂ€llen wurden Techniken fĂŒr das AbschĂ€tzen eines einzelnen Operators auf einer einzelnen Datenverteilung untersucht, wohingegen Datenbanksysteme in der Praxis eine Vielfalt von Anfragen ĂŒber diverse DatensĂ€tze unterstĂŒtzen mĂŒssen. Aus diesem Grund stellt diese Arbeit einen neuen Ansatz zur ResultatsabschĂ€tzung vor, welcher insofern ĂŒber bestehende AnsĂ€tze hinausgeht, als dass er akkurate AbschĂ€tzung beliebiger Kombinationen der drei wichtigsten Datenbank-Operatoren erlaubt: Selektion, Projektion und Join. Meine Technik basiert auf separaten und unabhĂ€ngigen Approximationen der Verteilung der Attributwerte eines Datensatzes und der Verteilung der HĂ€ufigkeiten dieser Attributwerte. Durch den Einsatz raumfĂŒllender Kurven können diese Approximationstechniken zudem auf mehrdimensionale Datenverteilungen angewandt werden, ohne ihre Genauigkeit und geringen Berechnungskosten einzubĂŒĂŸen. Die resultierende SchĂ€tzgenauigkeit ist vergleichbar mit der von auf einen einzigen Operator spezialisierten Techniken, und deutlich höher als die der auf Histogrammen basierenden AnsĂ€tze, welche momentan in kommerziellen Datenbanksystemen eingesetzt werden. Da Daten Synopsen im Arbeitsspeicher residieren, reduzieren sie den Speicher, der fĂŒr den Seitencache oder AusfĂŒhrungspuffer zur VerfĂŒgung steht. Somit sollte der fĂŒr Synopsen reservierte Speicher effizient genutzt werden, bzw. möglichst klein sein. Dies fĂŒhrt zu dem Problem, die optimale Kombination von Synopsen fĂŒr eine gegebene Kombination an Daten, Anfragen und verfĂŒgbarem Speicher zu bestimmen. Diese Arbeit stellt eine formale Beschreibung des Problems, sowie effiziente Algorithmen zu dessen Lösung vor. Alle beschriebenen Techniken werden in Hinsicht auf ihren Aufwand und die resultierende SchĂ€tzgenauigkeit mittels Experimenten ĂŒber eine Vielzahl von Datenverteilungen evaluiert

    Query estimation techniques in database systems

    Get PDF
    The effctiveness of query optimization in database systems critically depends on the system\u27;s ability to assess the execution costs of different query execution plans. For this purpose, the sizes and data distributions of the intermediate results generated during plan execution need to be estimated as accurately as possible. This estimation requires the maintenance of statistics on the data stored in the database, which are referred to as data synopses. While the problem of query cost estimation has received significant attention for over a decade, it has remained an open issue in practice, because most previous techniques have focused on singular aspects of the problem such as minimizing the estimation error of a single type of query and a single data distribution, whereas database management systems generally need to support a wide range of queries over a number of datasets. In this thesis I introduce a new technique for query result estimation, which extends existing techniques in that it offers estimation for all combinations of the three major database operators selection, projection, and join. The approach is based on separate and independent approximations of the attribute values contained in a dataset and their frequencies. Through the use of space-filling curves, the approach extends to multi-dimensional data, while maintaining its accuracy and computational properties. The resulting estimation accuracy is competitive with specialized techniques and superior to the histogram techniques currently implemented in commercial database management systems. Because data synopses reside in main memory, they compete for available space with the database cache and query execution buffers. Consequently, the memory available to data synopses needs to be used efficiently. This results in a physical design problem for data synopses, which is to determine the best set of synopses for a given combination of datasets, queries, and available memory. This thesis introduces a formalization of the problem, and efficient algorithmic solutions. All discussed techniques are evaluated with regard to their overhead and resulting estimation accuracy on a variety of synthetic and real-life datasets.Die EffektivitĂ€t der Anfrage-Optimierung in Datenbanksystemen hĂ€ngt entscheidend von der FĂ€higkeit des Systems ab, die Kosten der verschiedenen Möglichkeiten, eine Anfrage auszufĂŒhren, abzuschĂ€tzen. Zu diesem Zweck ist es nötig, die GrĂ¶ĂŸen und Datenverteilungen der Zwischenresultate, die wĂ€hrend der AusfĂŒhrung einer Anfrage generiert werden, so genau wie möglich zu schĂ€tzen. Zur Lösung dieses SchĂ€tzproblems benötigt man Statistiken ĂŒber die Daten, welche in dem Datenbanksystem gespeichert werden; diese Statistiken werden auch als Daten Synopsen bezeichnet. Obwohl das Problem der SchĂ€tzung von Anfragekosten innerhalb der letzten 10 Jahre intensiv untersucht wurde, gilt es weiterhin als offen, da viele der vorgeschlagenen AnsĂ€tze nur einen Teilaspekt des Problems betrachten. In den meisten FĂ€llen wurden Techniken fĂŒr das AbschĂ€tzen eines einzelnen Operators auf einer einzelnen Datenverteilung untersucht, wohingegen Datenbanksysteme in der Praxis eine Vielfalt von Anfragen ĂŒber diverse DatensĂ€tze unterstĂŒtzen mĂŒssen. Aus diesem Grund stellt diese Arbeit einen neuen Ansatz zur ResultatsabschĂ€tzung vor, welcher insofern ĂŒber bestehende AnsĂ€tze hinausgeht, als dass er akkurate AbschĂ€tzung beliebiger Kombinationen der drei wichtigsten Datenbank-Operatoren erlaubt: Selektion, Projektion und Join. Meine Technik basiert auf separaten und unabhĂ€ngigen Approximationen der Verteilung der Attributwerte eines Datensatzes und der Verteilung der HĂ€ufigkeiten dieser Attributwerte. Durch den Einsatz raumfĂŒllender Kurven können diese Approximationstechniken zudem auf mehrdimensionale Datenverteilungen angewandt werden, ohne ihre Genauigkeit und geringen Berechnungskosten einzubĂŒĂŸen. Die resultierende SchĂ€tzgenauigkeit ist vergleichbar mit der von auf einen einzigen Operator spezialisierten Techniken, und deutlich höher als die der auf Histogrammen basierenden AnsĂ€tze, welche momentan in kommerziellen Datenbanksystemen eingesetzt werden. Da Daten Synopsen im Arbeitsspeicher residieren, reduzieren sie den Speicher, der fĂŒr den Seitencache oder AusfĂŒhrungspuffer zur VerfĂŒgung steht. Somit sollte der fĂŒr Synopsen reservierte Speicher effizient genutzt werden, bzw. möglichst klein sein. Dies fĂŒhrt zu dem Problem, die optimale Kombination von Synopsen fĂŒr eine gegebene Kombination an Daten, Anfragen und verfĂŒgbarem Speicher zu bestimmen. Diese Arbeit stellt eine formale Beschreibung des Problems, sowie effiziente Algorithmen zu dessen Lösung vor. Alle beschriebenen Techniken werden in Hinsicht auf ihren Aufwand und die resultierende SchĂ€tzgenauigkeit mittels Experimenten ĂŒber eine Vielzahl von Datenverteilungen evaluiert

    Efficiently Processing Complex Queries in Sensor Networks

    Get PDF
    • 

    corecore