98 research outputs found
Optimal Tuning for Divide-and-conquer Kernel Ridge Regression with Massive Data
Divide-and-conquer is a powerful approach for large and massive data analysis. In the nonparameteric regression setting, although various theoretical frameworks have been established to achieve optimality in estimation or hypothesis testing, how to choose the tuning parameter in a practically effective way is still an open problem. In this paper, we propose a data-driven procedure based on divide-and-conquer for selecting the tuning parameters in kernel ridge regression by modifying the popular Generalized Cross-validation (GCV, Wahba, 1990). While the proposed criterion is computationally scalable for massive data sets, it is also shown under mild conditions to be asymptotically optimal in the sense that minimizing the proposed distributed-GCV (dGCV) criterion is equivalent to minimizing the true global conditional empirical loss of the averaged function estimator, extending the existing optimality results of GCV to the divide-and-conquer framework
Asymptotic optimality and efficient computation of the leave-subject-out cross-validation
Although the leave-subject-out cross-validation (CV) has been widely used in
practice for tuning parameter selection for various nonparametric and
semiparametric models of longitudinal data, its theoretical property is unknown
and solving the associated optimization problem is computationally expensive,
especially when there are multiple tuning parameters. In this paper, by
focusing on the penalized spline method, we show that the leave-subject-out CV
is optimal in the sense that it is asymptotically equivalent to the empirical
squared error loss function minimization. An efficient Newton-type algorithm is
developed to compute the penalty parameters that optimize the CV criterion.
Simulated and real data are used to demonstrate the effectiveness of the
leave-subject-out CV in selecting both the penalty parameters and the working
correlation matrix.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1063 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Distributed Adaptive Nearest Neighbor Classifier: Algorithm and Theory
When data is of an extraordinarily large size or physically stored in
different locations, the distributed nearest neighbor (NN) classifier is an
attractive tool for classification. We propose a novel distributed adaptive NN
classifier for which the number of nearest neighbors is a tuning parameter
stochastically chosen by a data-driven criterion. An early stopping rule is
proposed when searching for the optimal tuning parameter, which not only speeds
up the computation but also improves the finite sample performance of the
proposed Algorithm. Convergence rate of excess risk of the distributed adaptive
NN classifier is investigated under various sub-sample size compositions. In
particular, we show that when the sub-sample sizes are sufficiently large, the
proposed classifier achieves the nearly optimal convergence rate. Effectiveness
of the proposed approach is demonstrated through simulation studies as well as
an empirical application to a real-world dataset
A central limit theorem for a sequence of conditionally centered and -mixing random fields
A central limit theorem is established for a sum of random variables
belonging to a sequence of random fields. The fields are assumed to have zero
mean conditional on the past history and to satisfy certain -mixing
conditions in space or/and time. The limiting normal distribution is obtained
for increasing spatial domain, increasing length of the sequence or a
combination of these. The applicability of the theorem is demonstrated by
examples regarding estimating functions for a space-time point process and a
space-time Markov process
Variable Selection and Function Estimation Using Penalized Methods
Penalized methods are becoming more and more popular in statistical research. This dissertation research covers two major aspects of applications of penalized methods:
variable selection and nonparametric function estimation. The following two paragraphs give brief introductions to each of the two topics.
Infinite variance autoregressive models are important for modeling heavy-tailed time series. We use a penalty method to conduct model selection for autoregressive models with innovations in the domain of attraction of a stable law indexed by alpha is an element of (0, 2). We show that by combining the least absolute deviation loss function and the adaptive lasso penalty, we can consistently identify the true model. At the same time, the resulting coefficient estimator converges at a rate of n^(?1/alpha) . The proposed approach gives a unified variable selection procedure for both the finite and infinite variance autoregressive models.
While automatic smoothing parameter selection for nonparametric function estimation has been extensively researched for independent data, it is much less so for clustered and longitudinal data. Although leave-subject-out cross-validation (CV) has been widely used, its theoretical property is unknown and its minimization is computationally expensive, especially when there are multiple smoothing parameters. By focusing on penalized modeling methods, we show that leave-subject-out CV is optimal in that its minimization is asymptotically equivalent to the minimization of the true loss function. We develop an efficient Newton-type algorithm to compute the smoothing parameters that minimize the CV criterion. Furthermore, we derive one simplification of the leave-subject-out CV, which leads to a more efficient algorithm for selecting the smoothing parameters. We show that the simplified version of CV criteria is asymptotically equivalent to the unsimplified one and thus enjoys the same optimality property. This CV criterion also provides a completely data driven approach to select working covariance structure using generalized estimating equations in longitudinal data analysis. Our results are applicable to additive, linear varying-coefficient, nonlinear models with data from exponential families
Bias-correction and Test for Mark-point Dependence with Replicated Marked Point Processes
Mark-point dependence plays a critical role in research problems that can be
fitted into the general framework of marked point processes. In this work, we
focus on adjusting for mark-point dependence when estimating the mean and
covariance functions of the mark process, given independent replicates of the
marked point process. We assume that the mark process is a Gaussian process and
the point process is a log-Gaussian Cox process, where the mark-point
dependence is generated through the dependence between two latent Gaussian
processes. Under this framework, naive local linear estimators ignoring the
mark-point dependence can be severely biased. We show that this bias can be
corrected using a local linear estimator of the cross-covariance function and
establish uniform convergence rates of the bias-corrected estimators.
Furthermore, we propose a test statistic based on local linear estimators for
mark-point independence, which is shown to converge to an asymptotic normal
distribution in a parametric -convergence rate. Model diagnostics
tools are developed for key model assumptions and a robust functional
permutation test is proposed for a more general class of mark-point processes.
The effectiveness of the proposed methods is demonstrated using extensive
simulations and applications to two real data examples
Group Network Hawkes Process
In this work, we study the event occurrences of individuals interacting in a
network. To characterize the dynamic interactions among the individuals, we
propose a group network Hawkes process (GNHP) model whose network structure is
observed and fixed. In particular, we introduce a latent group structure among
individuals to account for the heterogeneous user-specific characteristics. A
maximum likelihood approach is proposed to simultaneously cluster individuals
in the network and estimate model parameters. A fast EM algorithm is
subsequently developed by utilizing the branching representation of the
proposed GNHP model. Theoretical properties of the resulting estimators of
group memberships and model parameters are investigated under both settings
when the number of latent groups is over-specified or correctly specified.
A data-driven criterion that can consistently identify the true under mild
conditions is derived. Extensive simulation studies and an application to a
data set collected from Sina Weibo are used to illustrate the effectiveness of
the proposed methodology.Comment: 35 page
- …