Search CORE

221 research outputs found

Approximation algorithms for wavelet transform coding of data streams

Author: Guha Sudipto
Harb Boulos
Publication venue
Publication date: 01/01/2006
Field of study

This paper addresses the problem of finding a B-term wavelet representation of a given discrete function

f \in \real^n

whose distance from f is minimized. The problem is well understood when we seek to minimize the Euclidean distance between f and its representation. The first known algorithms for finding provably approximate representations minimizing general

\ell_p

distances (including

\ell_\infty

) under a wide variety of compactly supported wavelet bases are presented in this paper. For the Haar basis, a polynomial time approximation scheme is demonstrated. These algorithms are applicable in the one-pass sublinear-space data stream model of computation. They generalize naturally to multiple dimensions and weighted norms. A universal representation that provides a provable approximation guarantee under all p-norms simultaneously; and the first approximation algorithms for bit-budget versions of the problem, known as adaptive quantization, are also presented. Further, it is shown that the algorithms presented here can be used to select a basis from a tree-structured dictionary of bases and find a B-term representation of the given function that provably approximates its best dictionary-basis representation.Comment: Added a universal representation that provides a provable approximation guarantee under all p-norms simultaneousl

arXiv.org e-Print Archive

CiteSeerX

ScholarlyCommons@Penn

Histograms and Wavelets on Probabilistic Data

Author: Cormode Graham
Garofalakis Minos
Publication venue
Publication date: 01/01/2008
Field of study

There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in probabilistic database systems. In this paper, we introduce definitions and algorithms for building histogram- and wavelet-based synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal B-term histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamic-programming-based techniques for the deterministic domain. Our experiments show that this approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data distribution, while taking equal or less time

arXiv.org e-Print Archive

CiteSeerX

Crossref

Warwick Research Archives Portal Repository

Nonlinear Approximation and Image Representation using Wavelets

Author: Guha Sudipto
Harb Boulos
Publication venue: Dagstuhl Seminar Proceedings. 07071 - Web Information Retrieval and Linear Algebra Algorithms
Publication date: 01/01/2007
Field of study

We address the problem of finding sparse wavelet representations of high-dimensional vectors. We present a lower-bounding technique and use it to develop an algorithm for computing provably-approximate instance-specific representations minimizing general

ell_p

distances under a wide variety of compactly-supported wavelet bases. More specifically, given a vector

f in mathbb{R}^n

, a compactly-supported wavelet basis, a sparsity constraint

B in mathbb{Z}

, and

pin[1,infty]

, our algorithm returns a

B

-term representation (a linear combination of

B

vectors from the given basis) whose

ell_p

distance from

f

is a

O(log n)

factor away from that of the optimal such representation of

f

. Our algorithm applies in the one-pass sublinear-space data streaming model of computation, and it generalize to weighted

p

-norms and multidimensional signals. Our technique also generalizes to a version of the problem where we are given a bit-budget rather than a term-budget. Furthermore, we use it to construct a emph{universal representation} that consists of at most

B(log n)^2

terms and gives a

O(log n)

-approximation under all

p

-norms simultaneously

Dagstuhl Research Online Publication Server

A Self-Adaptive Regression-Based Multivariate Data Compression Scheme with Error Bound in Wireless Sensor Networks

Author: Bai S.
Borgne Y. L.
Borgne Y. L.
Chen H.
Garofalakis M.
Garofalakis M.
Kabara J.
Kazemeyni F.
Kimura N.
Tulone D.
Zhu T. J.
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2013
Field of study

Wireless sensor networks (WSNs) have limited energy and transmission capacity, so data compression techniques have extensive applications. A sensor node with multiple sensing units is called a multimodal or multivariate node. For multivariate stream on a sensor node, some data streams are elected as the base functions according to the correlation coefficient matrix, and the other streams from the same node can be expressed in relation to one of these base functions using linear regression. By designing an incremental algorithm for computing regression coefficients, a multivariate data compression scheme based on self-adaptive regression with infinite norm error bound for WSNs is proposed. According to error bounds and compression incomes, the self-adaption means that the proposed algorithms make decisions automatically to transmit raw data or regression coefficients, and to select the number of data involved in regression. The algorithms in the scheme can simultaneously explore the temporal and multivariate correlations among the sensory data. Theoretically and experimentally, it is concluded that the proposed algorithms can effectively exploit the correlations on the same sensor node and achieve significant reduction in data transmission. Furthermore, the algorithms perform consistently well even when multivariate stream data correlations are less obvious or non-stationary. </jats:p

University of Essex Research Repository

Crossref

Directory of Open Access Journals

Link-based similarity search to fight web spam

Author: Benczúr András
Csalogány Károly
Sarlós Tamás
Publication venue: Lehigh Univ.
Publication date: 01/01/2006
Field of study

www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms. 1

CiteSeerX

SZTAKI Publication Repository