202 research outputs found

    Analyse canonique généralisée régularisée et approche PLS

    No full text
    International audienceNous donnons dans cette communication une définition de l'analyse canonique généralisée au niveau de la population (ACG-population) qui constitue le cadre théorique de l'approche PLS proposée par Herman Wold et à ses extensions proposées par Jan-Bernd Lohmöller et Nicole Krämer. En écrivant les équations stationnaires de l'ACG-population au niveau de l'échantillon et en utilisant des estimations régularisées (shrinkage estimations) des matrices de covariance des blocs, nous obtenons de nouvelles équations stationnaires au niveau de l'échantillon. Ces équations stationnaires sont également celles d'un problème d'optimisation que nous appelons analyse canonique généralisée régularisée (ACGR). En recherchant un point fixe de ces équations stationnaires au niveau de l'échantillon nous obtenons un algorithme très similaire à l'approche PLS de Wold-Lohmöller-Krämer. De plus, nous démontrons la convergence monotone de l'algorithme proposé. Mots-clés: Analyse de tableaux multiples, Approche PLS, Analyse canonique généralisée régularisé

    A low variance consistent test of relative dependency

    Get PDF
    We describe a novel non-parametric statistical hypothesis test of relative dependence between a source variable and two candidate target variables. Such a test enables us to determine whether one source variable is significantly more dependent on a first target variable or a second. Dependence is measured via the Hilbert-Schmidt Independence Criterion (HSIC), resulting in a pair of empirical dependence measures (source-target 1, source-target 2). We test whether the first dependence measure is significantly larger than the second. Modeling the covariance between these HSIC statistics leads to a provably more powerful test than the construction of independent HSIC statistics by sub-sampling. The resulting test is consistent and unbiased, and (being based on U-statistics) has favorable convergence properties. The test can be computed in quadratic time, matching the computational complexity of standard empirical HSIC estimators. The effectiveness of the test is demonstrated on several real-world problems: we identify language groups from a multilingual corpus, and we prove that tumor location is more dependent on gene expression than chromosomal imbalances. Source code is available for download at https://github.com/wbounliphone/reldep.Comment: International Conference on Machine Learning, Jul 2015, Lille, Franc

    SHrinkage Covariance Estimation Incorporating Prior Biological Knowledge with Applications to High-Dimensional Data

    Get PDF
    In ``-omic data'' analysis, information on the structure of covariates are broadly available either from public databases describing gene regulation processes and functional groups such as the Kyoto encyclopedia of genes and genomes (KEGG), or from statistical analyses -- for example in form of partial correlation estimators. The analysis of transcriptomic data might benefit from the incorporation of such prior knowledge. In this paper we focus on the integration of structured information into statistical analyses in which at least one major step involves the estimation of a (high-dimensional) covariance matrix. More precisely, we revisit the recently proposed ``SHrinkage Incorporating Prior'' (SHIP) covariance estimation method which takes into account the group structure of the covariates, and suggest to integrate the SHIP covariance estimator into various multivariate methods such as linear discriminant analysis (LDA), global analysis of covariance (GlobalANCOVA), and regularized generalized canonical correlation analysis (RGCCA). We demonstrate the use of the resulting new methods based on simulations and discuss the benefit of the integration of prior information through the SHIP estimator. Reproducible R codes are available at http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/shipproject/index.html

    SHrinkage Covariance Estimation Incorporating Prior Biological Knowledge with Applications to High-Dimensional Data

    Get PDF
    In ``-omic data'' analysis, information on the structure of covariates are broadly available either from public databases describing gene regulation processes and functional groups such as the Kyoto encyclopedia of genes and genomes (KEGG), or from statistical analyses -- for example in form of partial correlation estimators. The analysis of transcriptomic data might benefit from the incorporation of such prior knowledge. In this paper we focus on the integration of structured information into statistical analyses in which at least one major step involves the estimation of a (high-dimensional) covariance matrix. More precisely, we revisit the recently proposed ``SHrinkage Incorporating Prior'' (SHIP) covariance estimation method which takes into account the group structure of the covariates, and suggest to integrate the SHIP covariance estimator into various multivariate methods such as linear discriminant analysis (LDA), global analysis of covariance (GlobalANCOVA), and regularized generalized canonical correlation analysis (RGCCA). We demonstrate the use of the resulting new methods based on simulations and discuss the benefit of the integration of prior information through the SHIP estimator. Reproducible R codes are available at http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/shipproject/index.html

    Analyse Factorielle Discriminante Multi-voie

    No full text
    L'analyse factorielle discriminante est étendue aux données multi-voie, c'est-à-dire aux données pour lesquelles plusieurs modalités ont été observées pour chaque variable. Les données multi-voie sont ainsi structurées en tenseur. L'extension proposée repose sur une modélisation des axes discriminants. Cette modélisation prend en compte la structure tensorielle des données. Les gains attendus par rapport aux méthodes consistant à construire un classifieur à partir de la matrice obtenue par dépliement du tenseur, sont une meilleure interprétabilité et un meilleur comportement vis-à-vis du surapprentissage, phénomène d'autant plus présent dans le contexte multi-voie que le nombre de modalités est grand. Un algorithme de directions alternées permet d'obtenir les axes discriminants. Les performances obtenues sur données simulées permettent de confirmer ces gains

    Over-optimism in bioinformatics: an illustration

    Get PDF
    In statistical bioinformatics research, different optimization mechanisms potentially lead to "over-optimism" in published papers. The present empirical study illustrates these mechanisms through a concrete example from an active research field. The investigated sources of over-optimism include the optimization of the data sets, of the settings, of the competing methods and, most importantly, of the method’s characteristics. We consider a "promising" new classification algorithm that turns out to yield disappointing results in terms of error rate, namely linear discriminant analysis incorporating prior knowledge on gene functional groups through an appropriate shrinkage of the within-group covariance matrix. We quantitatively demonstrate that this disappointing method can artificially seem superior to existing approaches if we "fish for significance”. We conclude that, if the improvement of a quantitative criterion such as the error rate is the main contribution of a paper, the superiority of new algorithms should be validated using "fresh" validation data sets

    Regularized Generalized Canonical Correlation Analysis and PLS Path Modeling

    No full text
    International audienceRegularized Generalized Canonical Correlation Analysis (RGCCA) and Partial Least Squares Path Modeling (PLSPM) have been proposed for studying relationships between J sets of variables observed on the same set of individual staking into account a graph of connection between blocks. The main goal of this communication, is to compare the various options of PLS-PM and RGCCA. Actually, first comparisons show very close behavior of these two approaches

    Multiway Regularized Generalized Canonical Correlation Analysis

    No full text
    National audienceL'Analyse Canonique Généralisée Régularisée (RGCCA) permet l'´ etude des relations entre différents blocs de données. Dans ce papier, une version multivoie de RGCCA (MGCCA) est proposée. MGCCA cherche a décrire et comprendre les relations entre tenseurs

    Variable Selection in Partial Least Squares Methods: overview and recent developments

    Get PDF
    Recent developments in technology enable collecting a large amount of data from various sources. Moreover, many real world applications require studying relations among several groups of variables. The analysis of landscape matrices, i.e. matrices having more columns (variables, p) than rows (observations, n), is a challenging task in several domains. Two different kinds of problems arise when dealing with high dimensional data sets characterized by landscape matrices. The first refers to computational and numerical problems. The second deals with the difficulty in assessing and understanding the results. Dimension reduction seems to be a solution to solve both problems. We should distinguish between feature selection and feature extraction. The first refers to variable selection, while feature extraction aims to transform the data from high-dimensional space to low-dimensional space. Partial Least Squares (PLS) methods are classical feature extraction tools that work in the case of high-dimensional data sets. Since PLS methods do not require matrices inversion or diagonalization, they allow us to solve computational problems. However, results interpretation is still a hard problem when facing with very high-dimensional data sets. Moreover, recently Chun & Keles (2010) showed that asymptotic consistency of PLS regression estimator for the univariate case does not hold with the very large p and small n paradigm. Nowadays interest is increasing in developing new PLS methods able to be, at the same time, a feature extraction tool and a feature selection method. The first attempt to perform variable selection in univariate PLS Regression framework was presented by Bastien et al. in 2005. More recently Le Cao et al. (2008) and Chun & Keles (2010) proposed two different approaches to include variable selection in PLS Regression, based on L1 penalization (Tibshirani, 1996). In our work, we will investigate all these approaches and discuss the pros and cons. Moreover, a new version of PLS Path Modeling algorithm including variable selection will be presented

    Analyse différentielle de puces à ADN. Comparaison entre méthodes wrapper et filter.

    Get PDF
    13Dans le cadre de données d'expression génétique, nous nous intéressons aux méthodes qui permettent d'identifier les gènes significativement différentiellement exprimés entre deux situations biologiques. Nous allons comparer une méthode classique d'analyse par tests d'hypothèses à des méthodes d'analyse différentielle par régression régularisée. La difficulté de ce genre de jeu de données est la profusion de variables (les gènes) pour assez peu d'individus (les profils d'expression). La stratégie usuelle consiste à mettre en oeuvre autant de tests qu'il y a de variables et de considérer que les variables principales sont celles qui ont la « meilleure »p-value. Une stratégie alternative pourrait consister à choisir de classer les variables non plus en fonction de leur significativité (pour un test), mais plutôt de le classer suivant leur poids dans le modèle régularisé obtenu. Dans la bibliographie, les premières méthodes sont dites filter1, les deuxièmes sont plutôt dites wrapper2. Un bon aperçu de ce que sont les méthodes wrapper et filter est donné dans [9]. Le cadre ressemble à celui de l'apprentissage supervisé, car on dispose de profils d'expression géniques pour si possible l'ensemble du génome d'un organisme, chaque puce appartenant à une classe- situation biologique particulière (par exemple malade vs sain). L'implémentation des méthodes évoquées dans ce rapport a été effectuée sous R [16]
    corecore