7,696 research outputs found
An Algorithmic Framework for Computing Validation Performance Bounds by Using Suboptimal Models
Practical model building processes are often time-consuming because many
different models must be trained and validated. In this paper, we introduce a
novel algorithm that can be used for computing the lower and the upper bounds
of model validation errors without actually training the model itself. A key
idea behind our algorithm is using a side information available from a
suboptimal model. If a reasonably good suboptimal model is available, our
algorithm can compute lower and upper bounds of many useful quantities for
making inferences on the unknown target model. We demonstrate the advantage of
our algorithm in the context of model selection for regularized learning
problems
Consistency and convergence rate of phylogenetic inference via regularization
It is common in phylogenetics to have some, perhaps partial, information
about the overall evolutionary tree of a group of organisms and wish to find an
evolutionary tree of a specific gene for those organisms. There may not be
enough information in the gene sequences alone to accurately reconstruct the
correct "gene tree." Although the gene tree may deviate from the "species tree"
due to a variety of genetic processes, in the absence of evidence to the
contrary it is parsimonious to assume that they agree. A common statistical
approach in these situations is to develop a likelihood penalty to incorporate
such additional information. Recent studies using simulation and empirical data
suggest that a likelihood penalty quantifying concordance with a species tree
can significantly improve the accuracy of gene tree reconstruction compared to
using sequence data alone. However, the consistency of such an approach has not
yet been established, nor have convergence rates been bounded. Because
phylogenetics is a non-standard inference problem, the standard theory does not
apply. In this paper, we propose a penalized maximum likelihood estimator for
gene tree reconstruction, where the penalty is the square of the
Billera-Holmes-Vogtmann geodesic distance from the gene tree to the species
tree. We prove that this method is consistent, and derive its convergence rate
for estimating the discrete gene tree structure and continuous edge lengths
(representing the amount of evolution that has occurred on that branch)
simultaneously. We find that the regularized estimator is "adaptive fast
converging," meaning that it can reconstruct all edges of length greater than
any given threshold from gene sequences of polynomial length. Our method does
not require the species tree to be known exactly; in fact, our asymptotic
theory holds for any such guide tree.Comment: 34 pages, 5 figures. To appear on The Annals of Statistic
Early stopping and non-parametric regression: An optimal data-dependent stopping rule
The strategy of early stopping is a regularization technique based on
choosing a stopping time for an iterative algorithm. Focusing on non-parametric
regression in a reproducing kernel Hilbert space, we analyze the early stopping
strategy for a form of gradient-descent applied to the least-squares loss
function. We propose a data-dependent stopping rule that does not involve
hold-out or cross-validation data, and we prove upper bounds on the squared
error of the resulting function estimate, measured in either the and
norm. These upper bounds lead to minimax-optimal rates for various
kernel classes, including Sobolev smoothness classes and other forms of
reproducing kernel Hilbert spaces. We show through simulation that our stopping
rule compares favorably to two other stopping rules, one based on hold-out data
and the other based on Stein's unbiased risk estimate. We also establish a
tight connection between our early stopping strategy and the solution path of a
kernel ridge regression estimator.Comment: 29 pages, 4 figure
- …