329 research outputs found
No unbiased Estimator of the Variance of K-Fold Cross-Validation
In statistical machine learning, the standard measure of accuracy for models is the prediction error, i.e. the expected loss on future examples. When the data distribution is unknown, it cannot be computed but several resampling methods, such as K-fold cross-validation can be used to obtain an unbiased estimator of prediction error. However, to compare learning algorithms one needs to also estimate the uncertainty around the cross-validation estimator, which is important because it can be very large. However, the usual variance estimates for means of independent samples cannot be used because of the reuse of the data used to form the cross-validation estimator. The main result of this paper is that there is no universal (distribution independent) unbiased estimator of the variance of the K-fold cross-validation estimator, based only on the empirical results of the error measurements obtained through the cross-validation procedure. The analysis provides a theoretical understanding showing the difficulty of this estimation. These results generalize to other resampling methods, as long as data are reused for training or testing. L'erreur de prédiction, donc la perte attendue sur des données futures, est la mesure standard pour la qualité des modèles d'apprentissage statistique. Quand la distribution des données est inconnue, cette erreur ne peut être calculée mais plusieurs méthodes de rééchantillonnage, comme la validation croisée, peuvent être utilisées pour obtenir un estimateur non-biaisé de l'erreur de prédiction. Cependant pour comparer des algorithmes d'apprentissage, il faut aussi estimer l'incertitude autour de cet estimateur d'erreur future, car cette incertitude peut être très grande. Cependant, les estimateurs ordinaires de variance d'une moyenne pour des échantillons indépendants ne peuvent être utilisés à cause du recoupement des ensembles d'apprentissage utilisés pour effectuer la validation croisée. Le résultat principal de cet article est qu'il n'existe pas d'estimateur non-biaisé universel (indépendant de la distribution) de la variance de la validation croisée, en se basant sur les mesures d'erreur faites durant la validation croisée. L'analyse fournit une meilleure compréhension de la difficulté d'estimer l'incertitude autour de la validation croisée. Ces résultats se généralisent à d'autres méthodes de rééchantillonnage pour lesquelles des données sont réutilisées pour l'apprentissage ou le test.Prediction error, cross-validation, multivariate variance estimators, statistical comparison of algorithms, Erreur de prédiction, validation croisée, estimateur de variance multivariée, comparaison statistique des algorithmes
Inferring Multiple Graphical Structures
Gaussian Graphical Models provide a convenient framework for representing
dependencies between variables. Recently, this tool has received a high
interest for the discovery of biological networks. The literature focuses on
the case where a single network is inferred from a set of measurements, but, as
wetlab data is typically scarce, several assays, where the experimental
conditions affect interactions, are usually merged to infer a single network.
In this paper, we propose two approaches for estimating multiple related
graphs, by rendering the closeness assumption into an empirical prior or group
penalties. We provide quantitative results demonstrating the benefits of the
proposed approaches. The methods presented in this paper are embeded in the R
package 'simone' from version 1.0-0 and later
Sparsity with sign-coherent groups of variables via the cooperative-Lasso
We consider the problems of estimation and selection of parameters endowed
with a known group structure, when the groups are assumed to be sign-coherent,
that is, gathering either nonnegative, nonpositive or null parameters. To
tackle this problem, we propose the cooperative-Lasso penalty. We derive the
optimality conditions defining the cooperative-Lasso estimate for generalized
linear models, and propose an efficient active set algorithm suited to
high-dimensional problems. We study the asymptotic consistency of the estimator
in the linear regression setup and derive its irrepresentable conditions, which
are milder than the ones of the group-Lasso regarding the matching of groups
with the sparsity pattern of the true parameters. We also address the problem
of model selection in linear regression by deriving an approximation of the
degrees of freedom of the cooperative-Lasso estimator. Simulations comparing
the proposed estimator to the group and sparse group-Lasso comply with our
theoretical results, showing consistent improvements in support recovery for
sign-coherent groups. We finally propose two examples illustrating the wide
applicability of the cooperative-Lasso: first to the processing of ordinal
variables, where the penalty acts as a monotonicity prior; second to the
processing of genomic data, where the set of differentially expressed probes is
enriched by incorporating all the probes of the microarray that are related to
the corresponding genes.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS520 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Beyond Support in Two-Stage Variable Selection
Numerous variable selection methods rely on a two-stage procedure, where a
sparsity-inducing penalty is used in the first stage to predict the support,
which is then conveyed to the second stage for estimation or inference
purposes. In this framework, the first stage screens variables to find a set of
possibly relevant variables and the second stage operates on this set of
candidate variables, to improve estimation accuracy or to assess the
uncertainty associated to the selection of variables. We advocate that more
information can be conveyed from the first stage to the second one: we use the
magnitude of the coefficients estimated in the first stage to define an
adaptive penalty that is applied at the second stage. We give two examples of
procedures that can benefit from the proposed transfer of information, in
estimation and inference problems respectively. Extensive simulations
demonstrate that this transfer is particularly efficient when each stage
operates on distinct subsamples. This separation plays a crucial role for the
computation of calibrated p-values, allowing to control the False Discovery
Rate. In this setup, the proposed transfer results in sensitivity gains ranging
from 50% to 100% compared to state-of-the-art
Learning from Partial Labels with Minimum Entropy
This paper introduces the minimum entropy regularizer for learning from partial labels. This learning problem encompasses the semi-supervised setting, where a decision rule is to be learned from labeled and unlabeled examples. The minimum entropy regularizer applies to diagnosis models, i.e. models of the posterior probabilities of classes. It is shown to include other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed criterion provides solutions taking advantage of unlabeled examples when the latter convey information. Even when the data are sampled from the distribution class spanned by a generative model, the proposed approach improves over the estimated generative model when the number of features is of the order of sample size. The performances are definitely in favor of minimum entropy when the generative model is slightly misspecified. Finally, the robustness of the learning scheme is demonstrated: in situations where unlabeled examples do not convey information, minimum entropy returns a solution discarding unlabeled examples and performs as well as supervised learning. Cet article introduit le régularisateur à entropie minimum pour l'apprentissage d'étiquettes partielles. Ce problème d'apprentissage incorpore le cadre non supervisé, où une règle de décision doit être apprise à partir d'exemples étiquetés et non étiquetés. Le régularisateur à entropie minimum s'applique aux modèles de diagnostics, c'est-à-dire aux modèles des probabilités postérieures de classes. Nous montrons comment inclure d'autres approches comme un cas particulier ou limité du problème semi-supervisé. Une série d'expériences montrent que le critère proposé fournit des solutions utilisant les exemples non étiquetés lorsque ces dernières sont instructives. Même lorsque les données sont échantillonnées à partir de la classe de distribution balayée par un modèle génératif, l'approche mentionnée améliore le modèle génératif estimé lorsque le nombre de caractéristiques est de l'ordre de la taille de l'échantillon. Les performances avantagent certainement l'entropie minimum lorsque le modèle génératif est légèrement mal spécifié. Finalement, la robustesse de ce cadre d'apprentissage est démontré : lors de situations où les exemples non étiquetés n'apportent aucune information, l'entropie minimum retourne une solution rejetant les exemples non étiquetés et est aussi performante que l'apprentissage supervisé.discriminant learning, semi-supervised learning, minimum entropy, apprentissage discriminant, apprentissage semi-supervisé, entropie minimum
Composite kernel learning
The Support Vector Machine (SVM) is an acknowledged powerful tool for building classifiers, but it lacks flexibility, in the sense that the kernel is chosen prior to learning. Multiple Kernel Learning (MKL) enables to learn the kernel, from an ensemble of basis kernels, whose combination is optimized in the learning process. Here, we propose Composite Kernel Learning to address the situation where distinct components give rise to a group structure among kernels. Our formulation of the learning problem encompasses several setups, putting more or less emphasis on the group structure. We characterize the convexity of the learning problem, and provide a general wrapper algorithm for computing solutions. Finally, we illustrate the behavior of our method on multi-channel data where groups correpond to channels. 1
- …