190 research outputs found
Fast Cross-Validation via Sequential Testing
With the increasing size of today's data sets, finding the right parameter
configuration in model selection via cross-validation can be an extremely
time-consuming task. In this paper we propose an improved cross-validation
procedure which uses nonparametric testing coupled with sequential analysis to
determine the best parameter set on linearly increasing subsets of the data. By
eliminating underperforming candidates quickly and keeping promising candidates
as long as possible, the method speeds up the computation while preserving the
capability of the full cross-validation. Theoretical considerations underline
the statistical power of our procedure. The experimental evaluation shows that
our method reduces the computation time by a factor of up to 120 compared to a
full cross-validation with a negligible impact on the accuracy
Fast cross-validation of kernel fisher discriminant classifiers
Given n training examples, the training of a Kernel Fisher Discriminant (KFD) classifier corresponds to solving a linear system of dimension n. In cross-validating KFD, the training examples are split into 2 distinct subsets for a number of times (L) wherein a subset of m examples is used for validation and the other subset of(n - m) examples is used for training the classifier. In this case L linear systems of dimension (n - m) need to be solved. We propose a novel method for cross-validation of KFD in which instead of solving L linear systems of dimension (n - m), we compute the inverse of an n × n matrix and solve L linear systems of dimension 2m, thereby reducing the complexity when L is large and/or m is small. For typical 10-fold and leave-one-out cross-validations, the proposed algorithm is approximately 4 and (4/9n) times respectively as efficient as the naive implementations. Simulations are provided to demonstrate the efficiency of the proposed algorithms.<br /
Fast cross-validation for multi-penalty ridge regression
High-dimensional prediction with multiple data types needs to account for
potentially strong differences in predictive signal. Ridge regression is a
simple model for high-dimensional data that has challenged the predictive
performance of many more complex models and learners, and that allows inclusion
of data type specific penalties. The largest challenge for multi-penalty ridge
is to optimize these penalties efficiently in a cross-validation (CV) setting,
in particular for GLM and Cox ridge regression, which require an additional
estimation loop by iterative weighted least squares (IWLS). Our main
contribution is a computationally very efficient formula for the multi-penalty,
sample-weighted hat-matrix, as used in the IWLS algorithm. As a result, nearly
all computations are in low-dimensional space, rendering a speed-up of several
orders of magnitude. We developed a flexible framework that facilitates
multiple types of response, unpenalized covariates, several performance
criteria and repeated CV. Extensions to paired and preferential data types are
included and illustrated on several cancer genomics survival prediction
problems. Moreover, we present similar computational shortcuts for maximum
marginal likelihood and Bayesian probit regression. The corresponding
R-package, multiridge, serves as a versatile standalone tool, but also as a
fast benchmark for other more complex models and multi-view learners
- …