5 research outputs found

    Approximation algorithms for wavelet transform coding of data streams

    Get PDF
    This paper addresses the problem of finding a B-term wavelet representation of a given discrete function fβˆˆβ„œnf \in \real^n whose distance from f is minimized. The problem is well understood when we seek to minimize the Euclidean distance between f and its representation. The first known algorithms for finding provably approximate representations minimizing general β„“p\ell_p distances (including β„“βˆž\ell_\infty) under a wide variety of compactly supported wavelet bases are presented in this paper. For the Haar basis, a polynomial time approximation scheme is demonstrated. These algorithms are applicable in the one-pass sublinear-space data stream model of computation. They generalize naturally to multiple dimensions and weighted norms. A universal representation that provides a provable approximation guarantee under all p-norms simultaneously; and the first approximation algorithms for bit-budget versions of the problem, known as adaptive quantization, are also presented. Further, it is shown that the algorithms presented here can be used to select a basis from a tree-structured dictionary of bases and find a B-term representation of the given function that provably approximates its best dictionary-basis representation.Comment: Added a universal representation that provides a provable approximation guarantee under all p-norms simultaneousl

    Approximation Algorithms for Wavelet Transform Coding of Data Streams

    No full text
    Given a set of orthonormal basis functions {ψi} and a target function/vector f, the wavelet representation problem is to construct Λ† f as a combination of at most B basis vectors to minimize some normed distance between f and Λ† f. The problem is well understood if the error is the mean squared error: The largest (ignoring signs) B coefficients of the wavelet expansion should be retained. This strategy follows from the proof of optimality and is not a built-in constraint. The mean squared error, however, is not the optimization criterion in several scenarios. The above easy solution to the wavelet representation problem does not carry over to β„“p for p οΏ½ = 2, and it turns out that restricting the solution to any subset of coefficients of size B or less is suboptimal compared to the best solution which can choose arbitrary real numbers. Further, all the previous literature on non-β„“2 errors only considered the Haar system. In this paper we provide the first approximation schemes for the unrestricted optimization problem. We provide a lower bounding technique based on a system of inequalities. We show that a modified greedy algorithm that retains the coefficients of expansion gives a O(log n) true (factor) approximation algorithm for a wide variety of compact wavelet systems, including Haar, Daubechies, Symmlets, Coiflets among others. This vindicates several scaling type algorithms which are used in practice. We subsequently augment the lower bound and give a FPTAS for the Haar system. The same ideas extend to a QPTAS for the more general class of compact wavelets mentioned above. We also consider adaptive quantization problems, which are generalizations of the B-term representations

    Approximation Algorithms for Wavelet Transform Coding of Data Streams

    No full text

    Mining complex data in highly streaming environments

    Get PDF
    Data is growing at a rapid rate because of advanced hardware and software technologies and platforms such as e-health systems, sensor networks, and social media. One of the challenging problems is storing, processing and transferring this big data in an efficient and effective way. One solution to tackle these challenges is to construct synopsis by means of data summarization techniques. Motivated by the fact that without summarization, processing, analyzing and communicating this vast amount of data is inefficient, this thesis introduces new summarization frameworks with the main goals of reducing communication costs and accelerating data mining processes in different application scenarios. Specifically, we study the following big data summarizaion techniques:(i) dimensionality reduction;(ii)clustering,and(iii)histogram, considering their importance and wide use in various areas and domains. In our work, we propose three different frameworks using these summarization techniques to cover three different aspects of big data:"Volume","Velocity"and"Variety" in centralized and decentralized platforms. We use dimensionality reduction techniques for summarizing large 2D-arrays, clustering and histograms for processing multiple data streams. With respect to the importance and rapid growth of emerging e-health applications such as tele-radiology and tele-medicine that require fast, low cost, and often lossless access to massive amounts of medical images and data over band limited channels,our first framework attempts to summarize streams of large volume medical images (e.g. X-rays) for the purpose of compression. Significant amounts of correlation and redundancy exist across different medical images. These can be extracted and used as a data summary to achieve better compression, and consequently less storage and less communication overheads on the network. We propose a novel memory-assisted compression framework as a learning-based universal coding, which can be used to complement any existing algorithm to further eliminate redundancies/similarities across images. This approach is motivated by the fact that, often in medical applications, massive amounts of correlated images from the same family are available as training data for learning the dependencies and deriving appropriate reference or synopses models. The models can then be used for compression of any new image from the same family. In particular, dimensionality reduction techniques such as Principal Component Analysis (PCA) and Non-negative Matrix Factorization (NMF) are applied on a set of images from training data to form the required reference models. The proposed memory-assisted compression allows each image to be processed independently of other images, and hence allows individual image access and transmission. In the second part of our work,we investigate the problem of summarizing distributed multidimensional data streams using clustering. We devise a distributed clustering framework, DistClusTree, that extends the centralized ClusTree approach. The main difficulty in distributed clustering is balancing communication costs and clustering quality. We tackle this in DistClusTree through combining spatial index summaries and online tracking for efficient local and global incremental clustering. We demonstrate through extensive experiments the efficacy of the framework in terms of communication costs and approximate clustering quality. In the last part, we use a multidimensional index structure to merge distributed summaries in the form of a centralized histogram as another widely used summarization technique with the application in approximate range query answering. In this thesis, we propose the index-based Distributed Mergeable Summaries (iDMS) framework based on kd-trees that addresses these challenges with data generative models of Gaussian mixture models (GMMs) and a Generative Adversarial Network (GAN). iDMS maintains a global approximate kd-tree at a central site via GMMs or GANs upon new arrivals of streaming data at local sites. Experimental results validate the effectiveness and efficiency of iDMS against baseline distributed settings in terms of approximation error and communication costs

    Efficiently Processing Complex Queries in Sensor Networks

    Get PDF
    corecore