1,889 research outputs found
Estimating Entropy of Data Streams Using Compressed Counting
The Shannon entropy is a widely used summary statistic, for example, network
traffic measurement, anomaly detection, neural computations, spike trains, etc.
This study focuses on estimating Shannon entropy of data streams. It is known
that Shannon entropy can be approximated by Reenyi entropy or Tsallis entropy,
which are both functions of the p-th frequency moments and approach Shannon
entropy as p->1.
Compressed Counting (CC) is a new method for approximating the p-th frequency
moments of data streams. Our contributions include:
1) We prove that Renyi entropy is (much) better than Tsallis entropy for
approximating Shannon entropy.
2) We propose the optimal quantile estimator for CC, which considerably
improves the previous estimators.
3) Our experiments demonstrate that CC is indeed highly effective
approximating the moments and entropies. We also demonstrate the crucial
importance of utilizing the variance-bias trade-off
Sequential Quantiles via Hermite Series Density Estimation
Sequential quantile estimation refers to incorporating observations into
quantile estimates in an incremental fashion thus furnishing an online estimate
of one or more quantiles at any given point in time. Sequential quantile
estimation is also known as online quantile estimation. This area is relevant
to the analysis of data streams and to the one-pass analysis of massive data
sets. Applications include network traffic and latency analysis, real time
fraud detection and high frequency trading. We introduce new techniques for
online quantile estimation based on Hermite series estimators in the settings
of static quantile estimation and dynamic quantile estimation. In the static
quantile estimation setting we apply the existing Gauss-Hermite expansion in a
novel manner. In particular, we exploit the fact that Gauss-Hermite
coefficients can be updated in a sequential manner. To treat dynamic quantile
estimation we introduce a novel expansion with an exponentially weighted
estimator for the Gauss-Hermite coefficients which we term the Exponentially
Weighted Gauss-Hermite (EWGH) expansion. These algorithms go beyond existing
sequential quantile estimation algorithms in that they allow arbitrary
quantiles (as opposed to pre-specified quantiles) to be estimated at any point
in time. In doing so we provide a solution to online distribution function and
online quantile function estimation on data streams. In particular we derive an
analytical expression for the CDF and prove consistency results for the CDF
under certain conditions. In addition we analyse the associated quantile
estimator. Simulation studies and tests on real data reveal the Gauss-Hermite
based algorithms to be competitive with a leading existing algorithm.Comment: 43 pages, 9 figures. Improved version incorporating referee comments,
as appears in Electronic Journal of Statistic
Simulation of the spatio-temporal extent of groundwater flooding using statistical methods of hydrograph classification and lumped parameter models
This article presents the development of a relatively low cost and rapidly applicable methodology to simulate the spatio-temporal occurrence of groundwater flooding in chalk catchments. In winter 2000/2001 extreme rainfall resulted in anomalously high groundwater levels and groundwater flooding in many chalk catchments of northern Europe and the southern United Kingdom. Groundwater flooding was extensive and prolonged, occurring in areas where it had not been recently observed and, in places, lasting for 6 months. In many of these catchments, the prediction of groundwater flooding is hindered by the lack of an appropriate tool, such as a distributed groundwater model, or the inability of models to simulate extremes adequately. A set of groundwater hydrographs is simulated using a simple lumped parameter groundwater model. The number of models required is minimized through the classification and grouping of groundwater level time-series using principal component analysis and cluster analysis. One representative hydrograph is modelled then transposed to other observed hydrographs in the same group by the process of quantile mapping. Time-variant groundwater level surfaces, generated using the discrete set of modelled hydrographs and river elevation data, are overlain on a digital terrain model to predict the spatial extent of groundwater flooding. The methodology is applied to the Pang and Lambourn catchments in southern England for which monthly groundwater level time-series exist for 52 observation boreholes covering the period 1975–2004. The results are validated against observed groundwater flood extent data obtained from aerial surveys and field mapping. The method is shown to simulate the spatial and temporal occurrence of flooding during the 2000/2001 flood event accurately
UDDSketch: Accurate Tracking of Quantiles in Data Streams
none5noopenI. Epicoco, C. Melle, M. Cafaro, M. Pulimeno, G. MorleoEpicoco, I.; Melle, C.; Cafaro, M.; Pulimeno, M.; Morleo, G
- …