Search CORE

30 research outputs found

Approximation algorithms for wavelet transform coding of data streams

Author: Guha Sudipto
Harb Boulos
Publication venue
Publication date: 01/01/2006
Field of study

This paper addresses the problem of finding a B-term wavelet representation of a given discrete function

f \in \real^n

whose distance from f is minimized. The problem is well understood when we seek to minimize the Euclidean distance between f and its representation. The first known algorithms for finding provably approximate representations minimizing general

\ell_p

distances (including

\ell_\infty

) under a wide variety of compactly supported wavelet bases are presented in this paper. For the Haar basis, a polynomial time approximation scheme is demonstrated. These algorithms are applicable in the one-pass sublinear-space data stream model of computation. They generalize naturally to multiple dimensions and weighted norms. A universal representation that provides a provable approximation guarantee under all p-norms simultaneously; and the first approximation algorithms for bit-budget versions of the problem, known as adaptive quantization, are also presented. Further, it is shown that the algorithms presented here can be used to select a basis from a tree-structured dictionary of bases and find a B-term representation of the given function that provably approximates its best dictionary-basis representation.Comment: Added a universal representation that provides a provable approximation guarantee under all p-norms simultaneousl

arXiv.org e-Print Archive

CiteSeerX

ScholarlyCommons@Penn

Histograms and Wavelets on Probabilistic Data

Author: Cormode Graham
Garofalakis Minos
Publication venue
Publication date: 01/01/2008
Field of study

There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in probabilistic database systems. In this paper, we introduce definitions and algorithms for building histogram- and wavelet-based synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal B-term histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamic-programming-based techniques for the deterministic domain. Our experiments show that this approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data distribution, while taking equal or less time

arXiv.org e-Print Archive

CiteSeerX

Crossref

Warwick Research Archives Portal Repository

Nonlinear Approximation and Image Representation using Wavelets

Author: Guha Sudipto
Harb Boulos
Publication venue: Dagstuhl Seminar Proceedings. 07071 - Web Information Retrieval and Linear Algebra Algorithms
Publication date: 01/01/2007
Field of study

We address the problem of finding sparse wavelet representations of high-dimensional vectors. We present a lower-bounding technique and use it to develop an algorithm for computing provably-approximate instance-specific representations minimizing general

ell_p

distances under a wide variety of compactly-supported wavelet bases. More specifically, given a vector

f in mathbb{R}^n

, a compactly-supported wavelet basis, a sparsity constraint

B in mathbb{Z}

, and

pin[1,infty]

, our algorithm returns a

B

-term representation (a linear combination of

B

vectors from the given basis) whose

ell_p

distance from

f

is a

O(log n)

factor away from that of the optimal such representation of

f

. Our algorithm applies in the one-pass sublinear-space data streaming model of computation, and it generalize to weighted

p

-norms and multidimensional signals. Our technique also generalizes to a version of the problem where we are given a bit-budget rather than a term-budget. Furthermore, we use it to construct a emph{universal representation} that consists of at most

B(log n)^2

terms and gives a

O(log n)

-approximation under all

p

-norms simultaneously

Dagstuhl Research Online Publication Server

How to evaluate multiple range-sum queries progressively

Author: Cyrus Shahabi
Rolfe R. Schmidt
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2002
Field of study

Decision support system users typically submit batches of range-sum queries simultaneously rather than issuing individual, unrelated queries. We propose a wavelet based technique that exploits I/O sharing across a query batch to evaluate the set of queries progressively and efficiently. The challenge is that now controlling the structure of errors across query results becomes more critical than minimizing error per individual query. Consequently, we define a class of structural error penalty functions and show how they are controlled by our technique. Experiments demonstrate that our technique is efficient as an exact algorithm, and the progressive estimates are accurate, even after less than one I/O per query

CiteSeerX

Crossref

A Self-Adaptive Regression-Based Multivariate Data Compression Scheme with Error Bound in Wireless Sensor Networks

Author: Bai S.
Borgne Y. L.
Borgne Y. L.
Chen H.
Garofalakis M.
Garofalakis M.
Kabara J.
Kazemeyni F.
Kimura N.
Tulone D.
Zhu T. J.
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2013
Field of study

Wireless sensor networks (WSNs) have limited energy and transmission capacity, so data compression techniques have extensive applications. A sensor node with multiple sensing units is called a multimodal or multivariate node. For multivariate stream on a sensor node, some data streams are elected as the base functions according to the correlation coefficient matrix, and the other streams from the same node can be expressed in relation to one of these base functions using linear regression. By designing an incremental algorithm for computing regression coefficients, a multivariate data compression scheme based on self-adaptive regression with infinite norm error bound for WSNs is proposed. According to error bounds and compression incomes, the self-adaption means that the proposed algorithms make decisions automatically to transmit raw data or regression coefficients, and to select the number of data involved in regression. The algorithms in the scheme can simultaneously explore the temporal and multivariate correlations among the sensory data. Theoretically and experimentally, it is concluded that the proposed algorithms can effectively exploit the correlations on the same sensor node and achieve significant reduction in data transmission. Furthermore, the algorithms perform consistently well even when multivariate stream data correlations are less obvious or non-stationary. </jats:p

University of Essex Research Repository

Crossref

Directory of Open Access Journals

Mining complex data in highly streaming environments

Author: Razavi Hesabi Z
Publication venue: RMIT University
Publication date
Field of study

Data is growing at a rapid rate because of advanced hardware and software technologies and platforms such as e-health systems, sensor networks, and social media. One of the challenging problems is storing, processing and transferring this big data in an efficient and effective way. One solution to tackle these challenges is to construct synopsis by means of data summarization techniques. Motivated by the fact that without summarization, processing, analyzing and communicating this vast amount of data is inefficient, this thesis introduces new summarization frameworks with the main goals of reducing communication costs and accelerating data mining processes in different application scenarios. Specifically, we study the following big data summarizaion techniques:(i) dimensionality reduction;(ii)clustering,and(iii)histogram, considering their importance and wide use in various areas and domains. In our work, we propose three different frameworks using these summarization techniques to cover three different aspects of big data:&quot;Volume&quot;,&quot;Velocity&quot;and&quot;Variety&quot; in centralized and decentralized platforms. We use dimensionality reduction techniques for summarizing large 2D-arrays, clustering and histograms for processing multiple data streams. With respect to the importance and rapid growth of emerging e-health applications such as tele-radiology and tele-medicine that require fast, low cost, and often lossless access to massive amounts of medical images and data over band limited channels,our first framework attempts to summarize streams of large volume medical images (e.g. X-rays) for the purpose of compression. Significant amounts of correlation and redundancy exist across different medical images. These can be extracted and used as a data summary to achieve better compression, and consequently less storage and less communication overheads on the network. We propose a novel memory-assisted compression framework as a learning-based universal coding, which can be used to complement any existing algorithm to further eliminate redundancies/similarities across images. This approach is motivated by the fact that, often in medical applications, massive amounts of correlated images from the same family are available as training data for learning the dependencies and deriving appropriate reference or synopses models. The models can then be used for compression of any new image from the same family. In particular, dimensionality reduction techniques such as Principal Component Analysis (PCA) and Non-negative Matrix Factorization (NMF) are applied on a set of images from training data to form the required reference models. The proposed memory-assisted compression allows each image to be processed independently of other images, and hence allows individual image access and transmission. In the second part of our work,we investigate the problem of summarizing distributed multidimensional data streams using clustering. We devise a distributed clustering framework, DistClusTree, that extends the centralized ClusTree approach. The main difficulty in distributed clustering is balancing communication costs and clustering quality. We tackle this in DistClusTree through combining spatial index summaries and online tracking for efficient local and global incremental clustering. We demonstrate through extensive experiments the efficacy of the framework in terms of communication costs and approximate clustering quality. In the last part, we use a multidimensional index structure to merge distributed summaries in the form of a centralized histogram as another widely used summarization technique with the application in approximate range query answering. In this thesis, we propose the index-based Distributed Mergeable Summaries (iDMS) framework based on kd-trees that addresses these challenges with data generative models of Gaussian mixture models (GMMs) and a Generative Adversarial Network (GAN). iDMS maintains a global approximate kd-tree at a central site via GMMs or GANs upon new arrivals of streaming data at local sites. Experimental results validate the effectiveness and efficiency of iDMS against baseline distributed settings in terms of approximation error and communication costs

RMIT Research Repository

Behaviour Profiling using Wearable Sensors for Pervasive Healthcare

Author: Ali Syed Muhammad Raza
Publication venue: Computing, Imperial College London
Publication date: 01/02/2013
Field of study

In recent years, sensor technology has advanced in terms of hardware sophistication and miniaturisation. This has led to the incorporation of unobtrusive, low-power sensors into networks centred on human participants, called Body Sensor Networks. Amongst the most important applications of these networks is their use in healthcare and healthy living. The technology has the possibility of decreasing burden on the healthcare systems by providing care at home, enabling early detection of symptoms, monitoring recovery remotely, and avoiding serious chronic illnesses by promoting healthy living through objective feedback. In this thesis, machine learning and data mining techniques are developed to estimate medically relevant parameters from a participant‘s activity and behaviour parameters, derived from simple, body-worn sensors. The first abstraction from raw sensor data is the recognition and analysis of activity. Machine learning analysis is applied to a study of activity profiling to detect impaired limb and torso mobility. One of the advances in this thesis to activity recognition research is in the application of machine learning to the analysis of 'transitional activities': transient activity that occurs as people change their activity. A framework is proposed for the detection and analysis of transitional activities. To demonstrate the utility of transition analysis, we apply the algorithms to a study of participants undergoing and recovering from surgery. We demonstrate that it is possible to see meaningful changes in the transitional activity as the participants recover. Assuming long-term monitoring, we expect a large historical database of activity to quickly accumulate. We develop algorithms to mine temporal associations to activity patterns. This gives an outline of the user‘s routine. Methods for visual and quantitative analysis of routine using this summary data structure are proposed and validated. The activity and routine mining methodologies developed for specialised sensors are adapted to a smartphone application, enabling large-scale use. Validation of the algorithms is performed using datasets collected in laboratory settings, and free living scenarios. Finally, future research directions and potential improvements to the techniques developed in this thesis are outlined

Spiral - Imperial College Digital Repository