Search CORE

8,594 research outputs found

Estimating Entropy of Data Streams Using Compressed Counting

Author: Li Ping
Publication venue
Publication date: 01/01/2009
Field of study

The Shannon entropy is a widely used summary statistic, for example, network traffic measurement, anomaly detection, neural computations, spike trains, etc. This study focuses on estimating Shannon entropy of data streams. It is known that Shannon entropy can be approximated by Reenyi entropy or Tsallis entropy, which are both functions of the p-th frequency moments and approach Shannon entropy as p->1. Compressed Counting (CC) is a new method for approximating the p-th frequency moments of data streams. Our contributions include: 1) We prove that Renyi entropy is (much) better than Tsallis entropy for approximating Shannon entropy. 2) We propose the optimal quantile estimator for CC, which considerably improves the previous estimators. 3) Our experiments demonstrate that CC is indeed highly effective approximating the moments and entropies. We also demonstrate the crucial importance of utilizing the variance-bias trade-off

arXiv.org e-Print Archive

CiteSeerX

On Practical Algorithms for Entropy Estimation and the Improved Sample Complexity of Compressed Counting

Author: Li Ping
Publication venue
Publication date: 01/01/2010
Field of study

Estimating the p-th frequency moment of data stream is a very heavily studied problem. The problem is actually trivial when p = 1, assuming the strict Turnstile model. The sample complexity of our proposed algorithm is essentially O(1) near p=1. This is a very large improvement over the previously believed O(1/eps^2) bound. The proposed algorithm makes the long-standing problem of entropy estimation an easy task, as verified by the experiments included in the appendix

arXiv.org e-Print Archive

CiteSeerX

Random projections for Bayesian regression

Author: Geppert Leo N.
Ickstadt Katja
Munteanu Alexander
Quedenfeld Jens
Sohler Christian
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/11/2015
Field of study

This article deals with random projections applied as a data reduction technique for Bayesian regression analysis. We show sufficient conditions under which the entire

d

-dimensional distribution is approximately preserved under random projections by reducing the number of data points from

n

k\in O(\operatorname{poly}(d/\varepsilon))

in the case

n\gg d

. Under mild assumptions, we prove that evaluating a Gaussian likelihood function based on the projected data instead of the original data yields a

(1+O(\varepsilon))

-approximation in terms of the

\ell_2

Wasserstein distance. Our main result shows that the posterior distribution of Bayesian linear regression is approximated up to a small error depending on only an

\varepsilon

-fraction of its defining parameters. This holds when using arbitrary Gaussian priors or the degenerate case of uniform distributions over

\mathbb{R}^d

for

\beta

. Our empirical evaluations involve different simulated settings of Bayesian linear regression. Our experiments underline that the proposed method is able to recover the regression model up to small error while considerably reducing the total running time

arXiv.org e-Print Archive

Springer - Publisher Connector

Sign Stable Projections, Sign Cauchy Projections and Chi-Square Kernels

Author: Hopcroft John
Li Ping
Samorodnitsky Gennady
Publication venue
Publication date: 05/08/2013
Field of study

The method of stable random projections is popular for efficiently computing the Lp distances in high dimension (where 0<p<=2), using small space. Because it adopts nonadaptive linear projections, this method is naturally suitable when the data are collected in a dynamic streaming fashion (i.e., turnstile data streams). In this paper, we propose to use only the signs of the projected data and analyze the probability of collision (i.e., when the two signs differ). We derive a bound of the collision probability which is exact when p=2 and becomes less sharp when p moves away from 2. Interestingly, when p=1 (i.e., Cauchy random projections), we show that the probability of collision can be accurately approximated as functions of the chi-square similarity. For example, when the (un-normalized) data are binary, the maximum approximation error of the collision probability is smaller than 0.0192. In text and vision applications, the chi-square similarity is a popular measure for nonnegative data when the features are generated from histograms. Our experiments confirm that the proposed method is promising for large-scale learning applications

arXiv.org e-Print Archive

CiteSeerX

Recursive Sketching For Frequency Moments

Author: Braverman Vladimir
Ostrovsky Rafail
Publication venue
Publication date: 11/11/2010
Field of study

In a ground-breaking paper, Indyk and Woodruff (STOC 05) showed how to compute

F_k

(for

k>2

) in space complexity O(\mbox{\em poly-log}(n,m)\cdot n^{1-\frac2k}), which is optimal up to (large) poly-logarithmic factors in

n

and

m

, where

m

is the length of the stream and

n

is the upper bound on the number of distinct elements in a stream. The best known lower bound for large moments is

\Omega(\log(n)n^{1-\frac2k})

. A follow-up work of Bhuvanagiri, Ganguly, Kesh and Saha (SODA 2006) reduced the poly-logarithmic factors of Indyk and Woodruff to

O(\log^2(m)\cdot (\log n+ \log m)\cdot n^{1-{2\over k}})

. Further reduction of poly-log factors has been an elusive goal since 2006, when Indyk and Woodruff method seemed to hit a natural "barrier." Using our simple recursive sketch, we provide a different yet simple approach to obtain a

O(\log(m)\log(nm)\cdot (\log\log n)^4\cdot n^{1-{2\over k}})

algorithm for constant

\epsilon

(our bound is, in fact, somewhat stronger, where the

(\log\log n)

term can be replaced by any constant number of

\log

iterations instead of just two or three, thus approaching

log^*n

. Our bound also works for non-constant

\epsilon

(for details see the body of the paper). Further, our algorithm requires only

4

-wise independence, in contrast to existing methods that use pseudo-random generators for computing large frequency moments

arXiv.org e-Print Archive

CiteSeerX

Max-stable sketches: estimation of Lp-norms, dominance norms and point queries for non-negative signals

Author: Stoev Stilian A.
Taqqu Murad S.
Publication venue
Publication date: 01/01/2010
Field of study

Max-stable random sketches can be computed efficiently on fast streaming positive data sets by using only sequential access to the data. They can be used to answer point and Lp-norm queries for the signal. There is an intriguing connection between the so-called p-stable (or sum-stable) and the max-stable sketches. Rigorous performance guarantees through error-probability estimates are derived and the algorithmic implementation is discussed

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)

An Outer Gap Model of High-Energy Emission from Rotation-Powered Pulsars

Author: Chiang J.
Romani R. W.
Publication venue: 'University of Chicago Press'
Publication date: 18/01/1994
Field of study

We describe a refined calculation of high energy emission from rotation-powered pulsars based on the Outer Gap model of Cheng, Ho \&~Ruderman (1986a,b). We have improved upon previous efforts to model the spectra from these pulsars (e. g. Cheng, et al. 1986b; Ho 1989) by following the variation in particle production and radiation properties with position in the outer gap. Curvature, synchrotron and inverse-Compton scattering fluxes vary significantly over the gap and their interactions {\it via} photon-photon pair production build up the radiating charge populations at varying rates. We have also incorporated an approximate treatment of the transport of particle and photon fluxes between gap emission zones. These effects, along with improved computations of the particle and photon distributions, provide very important modifications of the model gamma-ray flux. In particular, we attempt to make specific predictions of pulse profile shapes and spectral variations as a function of pulse phase and suggest further extensions to the model which may provide accurate computations of the observed high energy emissions.Comment: 13 pages, LaTeX, for figures send request to [email protected]

arXiv.org e-Print Archive

Crossref

CERN Document Server