1,272 research outputs found
Horvitz-Thompson estimators for functional data: asymptotic confidence bands and optimal allocation for stratified sampling
When dealing with very large datasets of functional data, survey sampling
approaches are useful in order to obtain estimators of simple functional
quantities, without being obliged to store all the data. We propose here a
Horvitz--Thompson estimator of the mean trajectory. In the context of a
superpopulation framework, we prove under mild regularity conditions that we
obtain uniformly consistent estimators of the mean function and of its variance
function. With additional assumptions on the sampling design we state a
functional Central Limit Theorem and deduce asymptotic confidence bands.
Stratified sampling is studied in detail, and we also obtain a functional
version of the usual optimal allocation rule considering a mean variance
criterion. These techniques are illustrated by means of a test population of
N=18902 electricity meters for which we have individual electricity consumption
measures every 30 minutes over one week. We show that stratification can
substantially improve both the accuracy of the estimators and reduce the width
of the global confidence bands compared to simple random sampling without
replacement.Comment: Accepted for publication in Biometrik
Fast Estimation of the Median Covariation Matrix with Application to Online Robust Principal Components Analysis
The geometric median covariation matrix is a robust multivariate indicator of
dispersion which can be extended without any difficulty to functional data. We
define estimators, based on recursive algorithms, that can be simply updated at
each new observation and are able to deal rapidly with large samples of high
dimensional data without being obliged to store all the data in memory.
Asymptotic convergence properties of the recursive algorithms are studied under
weak conditions. The computation of the principal components can also be
performed online and this approach can be useful for online outlier detection.
A simulation study clearly shows that this robust indicator is a competitive
alternative to minimum covariance determinant when the dimension of the data is
small and robust principal components analysis based on projection pursuit and
spherical projections for high dimension data. An illustration on a large
sample and high dimensional dataset consisting of individual TV audiences
measured at a minute scale over a period of 24 hours confirms the interest of
considering the robust principal components analysis based on the median
covariation matrix. All studied algorithms are available in the R package
Gmedian on CRAN
Confidence bands for Horvitz-Thompson estimators using sampled noisy functional data
When collections of functional data are too large to be exhaustively
observed, survey sampling techniques provide an effective way to estimate
global quantities such as the population mean function. Assuming functional
data are collected from a finite population according to a probabilistic
sampling scheme, with the measurements being discrete in time and noisy, we
propose to first smooth the sampled trajectories with local polynomials and
then estimate the mean function with a Horvitz-Thompson estimator. Under mild
conditions on the population size, observation times, regularity of the
trajectories, sampling scheme, and smoothing bandwidth, we prove a Central
Limit theorem in the space of continuous functions. We also establish the
uniform consistency of a covariance function estimator and apply the former
results to build confidence bands for the mean function. The bands attain
nominal coverage and are obtained through Gaussian process simulations
conditional on the estimated covariance function. To select the bandwidth, we
propose a cross-validation method that accounts for the sampling weights. A
simulation study assesses the performance of our approach and highlights the
influence of the sampling scheme and bandwidth choice.Comment: Published in at http://dx.doi.org/10.3150/12-BEJ443 the Bernoulli
(http://isi.cbs.nl/bernoulli/) by the International Statistical
Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm
Online estimation of the geometric median in Hilbert spaces : non asymptotic confidence balls
Estimation procedures based on recursive algorithms are interesting and
powerful techniques that are able to deal rapidly with (very) large samples of
high dimensional data. The collected data may be contaminated by noise so that
robust location indicators, such as the geometric median, may be preferred to
the mean. In this context, an estimator of the geometric median based on a fast
and efficient averaged non linear stochastic gradient algorithm has been
developed by Cardot, C\'enac and Zitt (2013). This work aims at studying more
precisely the non asymptotic behavior of this algorithm by giving non
asymptotic confidence balls. This new result is based on the derivation of
improved rates of convergence as well as an exponential inequality for
the martingale terms of the recursive non linear Robbins-Monro algorithm
Homeworks: Stable Home + Stable School = Bright Futures
Chicago Coalition for the Homeless surveyed 118 homeless families with school-aged children and found that the experiences of Chicago's homeless students closely mirrored what the national research showed. Surveys were conducted at public schools, shelters, and parks during the summer of 2015. More than 80% of the families interviewed have between 1 and 3 school-aged children and less than 20% have more than three children attending school
A fast and recursive algorithm for clustering large datasets with -medians
Clustering with fast algorithms large samples of high dimensional data is an
important challenge in computational statistics. Borrowing ideas from MacQueen
(1967) who introduced a sequential version of the -means algorithm, a new
class of recursive stochastic gradient algorithms designed for the -medians
loss criterion is proposed. By their recursive nature, these algorithms are
very fast and are well adapted to deal with large samples of data that are
allowed to arrive sequentially. It is proved that the stochastic gradient
algorithm converges almost surely to the set of stationary points of the
underlying loss criterion. A particular attention is paid to the averaged
versions, which are known to have better performances, and a data-driven
procedure that allows automatic selection of the value of the descent step is
proposed.
The performance of the averaged sequential estimator is compared on a
simulation study, both in terms of computation speed and accuracy of the
estimations, with more classical partitioning techniques such as -means,
trimmed -means and PAM (partitioning around medoids). Finally, this new
online clustering technique is illustrated on determining television audience
profiles with a sample of more than 5000 individual television audiences
measured every minute over a period of 24 hours.Comment: Under revision for Computational Statistics and Data Analysi
- …