8,594 research outputs found
Estimating Entropy of Data Streams Using Compressed Counting
The Shannon entropy is a widely used summary statistic, for example, network
traffic measurement, anomaly detection, neural computations, spike trains, etc.
This study focuses on estimating Shannon entropy of data streams. It is known
that Shannon entropy can be approximated by Reenyi entropy or Tsallis entropy,
which are both functions of the p-th frequency moments and approach Shannon
entropy as p->1.
Compressed Counting (CC) is a new method for approximating the p-th frequency
moments of data streams. Our contributions include:
1) We prove that Renyi entropy is (much) better than Tsallis entropy for
approximating Shannon entropy.
2) We propose the optimal quantile estimator for CC, which considerably
improves the previous estimators.
3) Our experiments demonstrate that CC is indeed highly effective
approximating the moments and entropies. We also demonstrate the crucial
importance of utilizing the variance-bias trade-off
On Practical Algorithms for Entropy Estimation and the Improved Sample Complexity of Compressed Counting
Estimating the p-th frequency moment of data stream is a very heavily studied
problem. The problem is actually trivial when p = 1, assuming the strict
Turnstile model. The sample complexity of our proposed algorithm is essentially
O(1) near p=1. This is a very large improvement over the previously believed
O(1/eps^2) bound. The proposed algorithm makes the long-standing problem of
entropy estimation an easy task, as verified by the experiments included in the
appendix
Random projections for Bayesian regression
This article deals with random projections applied as a data reduction
technique for Bayesian regression analysis. We show sufficient conditions under
which the entire -dimensional distribution is approximately preserved under
random projections by reducing the number of data points from to in the case . Under mild
assumptions, we prove that evaluating a Gaussian likelihood function based on
the projected data instead of the original data yields a
-approximation in terms of the Wasserstein
distance. Our main result shows that the posterior distribution of Bayesian
linear regression is approximated up to a small error depending on only an
-fraction of its defining parameters. This holds when using
arbitrary Gaussian priors or the degenerate case of uniform distributions over
for . Our empirical evaluations involve different
simulated settings of Bayesian linear regression. Our experiments underline
that the proposed method is able to recover the regression model up to small
error while considerably reducing the total running time
Sign Stable Projections, Sign Cauchy Projections and Chi-Square Kernels
The method of stable random projections is popular for efficiently computing
the Lp distances in high dimension (where 0<p<=2), using small space. Because
it adopts nonadaptive linear projections, this method is naturally suitable
when the data are collected in a dynamic streaming fashion (i.e., turnstile
data streams). In this paper, we propose to use only the signs of the projected
data and analyze the probability of collision (i.e., when the two signs
differ). We derive a bound of the collision probability which is exact when p=2
and becomes less sharp when p moves away from 2. Interestingly, when p=1 (i.e.,
Cauchy random projections), we show that the probability of collision can be
accurately approximated as functions of the chi-square similarity. For example,
when the (un-normalized) data are binary, the maximum approximation error of
the collision probability is smaller than 0.0192. In text and vision
applications, the chi-square similarity is a popular measure for nonnegative
data when the features are generated from histograms. Our experiments confirm
that the proposed method is promising for large-scale learning applications
Recursive Sketching For Frequency Moments
In a ground-breaking paper, Indyk and Woodruff (STOC 05) showed how to
compute (for ) in space complexity O(\mbox{\em poly-log}(n,m)\cdot
n^{1-\frac2k}), which is optimal up to (large) poly-logarithmic factors in
and , where is the length of the stream and is the upper bound on
the number of distinct elements in a stream. The best known lower bound for
large moments is . A follow-up work of
Bhuvanagiri, Ganguly, Kesh and Saha (SODA 2006) reduced the poly-logarithmic
factors of Indyk and Woodruff to . Further reduction of poly-log factors has been an elusive
goal since 2006, when Indyk and Woodruff method seemed to hit a natural
"barrier." Using our simple recursive sketch, we provide a different yet simple
approach to obtain a algorithm for constant (our bound is, in fact, somewhat
stronger, where the term can be replaced by any constant number
of iterations instead of just two or three, thus approaching .
Our bound also works for non-constant (for details see the body of
the paper). Further, our algorithm requires only -wise independence, in
contrast to existing methods that use pseudo-random generators for computing
large frequency moments
Max-stable sketches: estimation of Lp-norms, dominance norms and point queries for non-negative signals
Max-stable random sketches can be computed efficiently on fast streaming
positive data sets by using only sequential access to the data. They can be
used to answer point and Lp-norm queries for the signal. There is an intriguing
connection between the so-called p-stable (or sum-stable) and the max-stable
sketches. Rigorous performance guarantees through error-probability estimates
are derived and the algorithmic implementation is discussed
An Outer Gap Model of High-Energy Emission from Rotation-Powered Pulsars
We describe a refined calculation of high energy emission from
rotation-powered pulsars based on the Outer Gap model of Cheng, Ho \&~Ruderman
(1986a,b). We have improved upon previous efforts to model the spectra from
these pulsars (e. g. Cheng, et al. 1986b; Ho 1989) by following the variation
in particle production and radiation properties with position in the outer gap.
Curvature, synchrotron and inverse-Compton scattering fluxes vary significantly
over the gap and their interactions {\it via} photon-photon pair production
build up the radiating charge populations at varying rates. We have also
incorporated an approximate treatment of the transport of particle and photon
fluxes between gap emission zones. These effects, along with improved
computations of the particle and photon distributions, provide very important
modifications of the model gamma-ray flux. In particular, we attempt to make
specific predictions of pulse profile shapes and spectral variations as a
function of pulse phase and suggest further extensions to the model which may
provide accurate computations of the observed high energy emissions.Comment: 13 pages, LaTeX, for figures send request to [email protected]
- …