295 research outputs found
Contour projected dimension reduction
In regression analysis, we employ contour projection (CP) to develop a new
dimension reduction theory. Accordingly, we introduce the notions of the
central contour subspace and generalized contour subspace. We show that both of
their structural dimensions are no larger than that of the central subspace
Cook [Regression Graphics (1998b) Wiley]. Furthermore, we employ CP-sliced
inverse regression, CP-sliced average variance estimation and CP-directional
regression to estimate the generalized contour subspace, and we subsequently
obtain their theoretical properties. Monte Carlo studies demonstrate that the
three CP-based dimension reduction methods outperform their corresponding
non-CP approaches when the predictors have heavy-tailed elliptical
distributions. An empirical example is also presented to illustrate the
usefulness of the CP method.Comment: Published in at http://dx.doi.org/10.1214/08-AOS679 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Statistical Analysis of Fixed Mini-Batch Gradient Descent Estimator
We study here a fixed mini-batch gradient decent (FMGD) algorithm to solve
optimization problems with massive datasets. In FMGD, the whole sample is split
into multiple non-overlapping partitions. Once the partitions are formed, they
are then fixed throughout the rest of the algorithm. For convenience, we refer
to the fixed partitions as fixed mini-batches. Then for each computation
iteration, the gradients are sequentially calculated on each fixed mini-batch.
Because the size of fixed mini-batches is typically much smaller than the whole
sample size, it can be easily computed. This leads to much reduced computation
cost for each computational iteration. It makes FMGD computationally efficient
and practically more feasible. To demonstrate the theoretical properties of
FMGD, we start with a linear regression model with a constant learning rate. We
study its numerical convergence and statistical efficiency properties. We find
that sufficiently small learning rates are necessarily required for both
numerical convergence and statistical efficiency. Nevertheless, an extremely
small learning rate might lead to painfully slow numerical convergence. To
solve the problem, a diminishing learning rate scheduling strategy can be used.
This leads to the FMGD estimator with faster numerical convergence and better
statistical efficiency. Finally, the FMGD algorithms with random shuffling and
a general loss function are also studied
Distributed Logistic Regression for Massive Data with Rare Events
Large-scale rare events data are commonly encountered in practice. To tackle
the massive rare events data, we propose a novel distributed estimation method
for logistic regression in a distributed system. For a distributed framework,
we face the following two challenges. The first challenge is how to distribute
the data. In this regard, two different distribution strategies (i.e., the
RANDOM strategy and the COPY strategy) are investigated. The second challenge
is how to select an appropriate type of objective function so that the best
asymptotic efficiency can be achieved. Then, the under-sampled (US) and inverse
probability weighted (IPW) types of objective functions are considered. Our
results suggest that the COPY strategy together with the IPW objective function
is the best solution for distributed logistic regression with rare events. The
finite sample performance of the distributed methods is demonstrated by
simulation studies and a real-world Sweden Traffic Sign dataset
Optimal Subsampling Bootstrap for Massive Data
The bootstrap is a widely used procedure for statistical inference because of
its simplicity and attractive statistical properties. However, the vanilla
version of bootstrap is no longer feasible computationally for many modern
massive datasets due to the need to repeatedly resample the entire data.
Therefore, several improvements to the bootstrap method have been made in
recent years, which assess the quality of estimators by subsampling the full
dataset before resampling the subsamples. Naturally, the performance of these
modern subsampling methods is influenced by tuning parameters such as the size
of subsamples, the number of subsamples, and the number of resamples per
subsample. In this paper, we develop a novel hyperparameter selection
methodology for selecting these tuning parameters. Formulated as an
optimization problem to find the optimal value of some measure of accuracy of
an estimator subject to computational cost, our framework provides closed-form
solutions for the optimal hyperparameter values for subsampled bootstrap,
subsampled double bootstrap and bag of little bootstraps, at no or little extra
time cost. Using the mean square errors as a proxy of the accuracy measure, we
apply our methodology to study, compare and improve the performance of these
modern versions of bootstrap developed for massive data through simulation
study. The results are promising
Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis with Limited Computational Resources
Modern statistical analysis often encounters datasets with large sizes. For
these datasets, conventional estimation methods can hardly be used immediately
because practitioners often suffer from limited computational resources. In
most cases, they do not have powerful computational resources (e.g., Hadoop or
Spark). How to practically analyze large datasets with limited computational
resources then becomes a problem of great importance. To solve this problem, we
propose here a novel subsampling-based method with jackknifing. The key idea is
to treat the whole sample data as if they were the population. Then, multiple
subsamples with greatly reduced sizes are obtained by the method of simple
random sampling with replacement. It is remarkable that we do not recommend
sampling methods without replacement because this would incur a significant
cost for data processing on the hard drive. Such cost does not exist if the
data are processed in memory. Because subsampled data have relatively small
sizes, they can be comfortably read into computer memory as a whole and then
processed easily. Based on subsampled datasets, jackknife-debiased estimators
can be obtained for the target parameter. The resulting estimators are
statistically consistent, with an extremely small bias. Finally, the
jackknife-debiased estimators from different subsamples are averaged together
to form the final estimator. We theoretically show that the final estimator is
consistent and asymptotically normal. Its asymptotic statistical efficiency can
be as good as that of the whole sample estimator under very mild conditions.
The proposed method is simple enough to be easily implemented on most practical
computer systems and thus should have very wide applicability
On the asymptotic properties of a bagging estimator with a massive dataset
Bagging is a useful method for large-scale statistical analysis, especially
when the computing resources are very limited. We study here the asymptotic
properties of bagging estimators for -estimation problems but with massive
datasets. We theoretically prove that the resulting estimator is consistent and
asymptotically normal under appropriate conditions. The results show that the
bagging estimator can achieve the optimal statistical efficiency, provided that
the bagging subsample size and the number of subsamples are sufficiently large.
Moreover, we derive a variance estimator for valid asymptotic inference. All
theoretical findings are further verified by extensive simulation studies.
Finally, we apply the bagging method to the US Airline Dataset to demonstrate
its practical usefulness
Subnetwork Estimation for Spatial Autoregressive Models in Large-scale Networks
Large-scale networks are commonly encountered in practice (e.g., Facebook and
Twitter) by researchers. In order to study the network interaction between
different nodes of large-scale networks, the spatial autoregressive (SAR) model
has been popularly employed. Despite its popularity, the estimation of a SAR
model on large-scale networks remains very challenging. On the one hand, due to
policy limitations or high collection costs, it is often impossible for
independent researchers to observe or collect all network information. On the
other hand, even if the entire network is accessible, estimating the SAR model
using the quasi-maximum likelihood estimator (QMLE) could be computationally
infeasible due to its high computational cost. To address these challenges, we
propose here a subnetwork estimation method based on QMLE for the SAR model. By
using appropriate sampling methods, a subnetwork, consisting of a much-reduced
number of nodes, can be constructed. Subsequently, the standard QMLE can be
computed by treating the sampled subnetwork as if it were the entire network.
This leads to a significant reduction in information collection and model
computation costs, which increases the practical feasibility of the effort.
Theoretically, we show that the subnetwork-based QMLE is consistent and
asymptotically normal under appropriate regularity conditions. Extensive
simulation studies, based on both simulated and real network structures, are
presented
Estimating Mixture of Gaussian Processes by Kernel Smoothing
When functional data are not homogenous, for example, when there are multiple classes of functional curves in the dataset, traditional estimation methods may fail. In this article, we propose a new estimation procedure for the mixture of Gaussian processes, to incorporate both functional and inhomogenous properties of the data. Our method can be viewed as a natural extension of high-dimensional normal mixtures. However, the key difference is that smoothed structures are imposed for both the mean and covariance functions. The model is shown to be identifiable, and can be estimated efficiently by a combination of the ideas from expectation-maximization (EM) algorithm, kernel regression, and functional principal component analysis. Our methodology is empirically justified by Monte Carlo simulations and illustrated by an analysis of a supermarket dataset
- …