26 research outputs found

    Persévérance et réussite scolaire par le forage de données d'éducation

    Get PDF
    La prĂ©sente recherche a Ă©tĂ© subventionnĂ©e par le ministĂšre de l'Éducation et de l'Enseignement supĂ©rieur dans le cadre du Programme d'aide Ă  la recherche sur l'enseignement et l'apprentissage (PAREA).Comprend des rĂ©fĂ©rences bibliographiques

    Improving Convergence for Nonconvex Composite Programming

    Full text link
    High-dimensional nonconvex problems are popular in today's machine learning and statistical genetics research. Recently, Ghadimi and Lan \cite{Ghadimi} proposed an algorithm to optimize nonconvex high-dimensional problems. There are several parameters in their algorithm that are to be set before running the algorithm. It is not trivial how to choose these parameters nor there is, to the best of our knowledge, an explicit rule how to select the parameters to make the algorithm converges faster. We analyze Ghadimi and Lan's algorithm to gain an interpretation based on the inequality constraints for convergence and the upper bound for the norm of the gradient analogue. Our interpretation of their algorithm suggests this to be a damped Nesterov's acceleration scheme. Based on this, we propose an approach on how to select the parameters to improve convergence of the algorithm. Our numerical studies using high-dimensional nonconvex sparse learning problems, motivated by image denoising and statistical genetics applications, show that convergence can be made, on average, considerably faster than that of the conventional ISTA algorithm for such optimization problems with over 1000010000 variables should the parameters be chosen using our proposed approach.Comment: 10 pages, 2 figure

    Penalized regression methods for interaction and mixed-effects models with applications to genomic and brain imaging data

    No full text
    In high-dimensional (HD) data, where the number of covariates (??) greatly exceeds the number of observations (??), estimation can benefit from the bet-on-sparsity principle, i.e., only a small number of predictors are relevant in the response. This assumption can lead to more interpretable models, improved predictive accuracy, and algorithms that are computationally efficient. In genomic and brain imaging studies, where the sample sizes are particularly small due to high data collection costs, we must often assume a sparse model because there isn't enough information to estimate ?? parameters. For these reasons, penalized regression methods such as the lasso and group-lasso have generated substantial interest since they can set model coefficients exactly to zero. In the penalized regression framework, many approaches have been developed for main effects. However, there is a need for developing interaction and mixed-effects models. Indeed, accurate capture of interactions may hold the potential to better understand biological phenomena and improve prediction accuracy since they may reflect important modulation of a biological system by an external factor. Furthermore, penalized mixed-effects models that account for correlations due to groupings of observations can improve sensitivity and specificity. This thesis is composed primarily of three manuscripts. In the first manuscript, we propose a method called sail for detecting non-linear interactions that automatically enforces the strong heredity property using both the ℓ₁ and ℓ₂ penalty functions. We describe a blockwise coordinate descent procedure for solving the objective function and provide performance metrics on both simulated and real data. The second manuscript develops a general penalized mixed effects model framework to account for correlations in genetic data due to relatedness called ggmix. Our method can accommodate several sparsity-inducing penalties such as the lasso, elastic net and group lasso and also readily handles prior annotation information in the form of weights. Our algorithm has theoretical guarantees of convergence and we again assess its performance in both simulated and real data. The third manuscript describes a novel strategy called eclust for dimension reduction that leverages the effects of an exposure variable with broad impact on HD measures. With eclust, we found improved prediction and variable selection performance compared to methods that do not consider the exposure in the clustering step, or to methods that use the original data as features. We further illustrate this modeling framework through the analysis of three data sets from very different fields, each with HD data, a binary exposure, and a phenotype of interest. We provide efficient implementations of all our algorithms in freely available and open source software.Avec des donnĂ©es de grande dimension (GD), oĂč le nombre de covariables (??) dĂ©passe largement le nombre d'observations (??), l'estimation peut profiter du principe miser sur la sparsitĂ©, c'est Ă  dire, seulement un petit nombre de prĂ©dicteurs de la variable rĂ©ponse sont rĂ©ellement pertinents. Cette hypothĂšse permet d'obtenir des modĂšles interprĂ©tables, amĂ©liore leur prĂ©cision et facilite l'implĂ©mentation d'algorithmes efficaces sur le plan des calculs. Dans les Ă©tudes d'imagerie gĂ©nomique et cĂ©rĂ©brale, oĂč la taille des Ă©chantillons est particuliĂšrement faible en raison des coĂ»ts Ă©levĂ©s associĂ©s Ă  la collecte de donnĂ©es, nous devons souvent supposer un modĂšle sparse car l'information est insuffisante pour estimer les ?? paramĂštres. Pour ces raisons, les mĂ©thodes de rĂ©gression pĂ©nalisĂ©es telles que lasso et group lasso ont suscitĂ© un intĂ©rĂȘt considĂ©rable, car elles permettent d'obtenir des estimĂ©s des coefficients du modĂšle Ă©gaux Ă  zĂ©ro. Dans le cadre de la rĂ©gression pĂ©nalisĂ©e, de nombreuses approches ont Ă©tĂ© dĂ©veloppĂ©es pour les effets principaux. Cependant, il y un besoin d'adapter ces approches pour les modĂšles d'intĂ©raction et les modĂšles mixtes. En effet, une capture prĂ©cise des interactions permettrait de mieux comprendre les phĂ©nomĂšnes biologiques et d'amĂ©liorer la prĂ©cision des prĂ©dictions, car elles reflĂštent souvent une modulation importante d'un systĂšme biologique par un facteur externe. De plus, les modĂšles mixtes pĂ©nalisĂ©s qui tiennent compte des corrĂ©lations dues aux regroupements d'observations peuvent amĂ©liorer la sensibilitĂ© et la spĂ©cificitĂ©. Cette thĂšse est composĂ©e principalement de trois manuscrits. Dans le premier manuscrit, nous proposons une mĂ©thode appelĂ©e sail pour dĂ©tecter les interactions non linĂ©aires qui applique automatiquement la propriĂ©tĂ© d'hĂ©rĂ©ditĂ© forte Ă  l'aide des pĂ©nalitĂ©s ℓ₁ et ℓ₂. Nous dĂ©crivons une procĂ©dure de descente de coordonnĂ©es par blocs pour optimiser la fonction objective, et nous dĂ©montrons la performance sur des donnĂ©es simulĂ©es et rĂ©elles. Le deuxiĂšme manuscrit dĂ©veloppe un cadre gĂ©nĂ©ral de modĂšles mixtes pĂ©nalisĂ©s pour tenir compte des corrĂ©lations dans les donnĂ©es gĂ©nĂ©tiques issues de familles. Notre mĂ©thode, appelĂ©e ggmix, peut tirer profit de plusieurs pĂ©nalitĂ©s, telles que lasso, elastic net et group lasso. Elle permet Ă©galement d'intĂ©grer des informations d'annotation antĂ©rieures sous forme de poids. Notre algorithme a des garanties thĂ©oriques de convergence et nous Ă©valuons Ă  nouveau ses performances Ă  l'aide de donnĂ©es simulĂ©es et de donnĂ©es rĂ©elles. Le troisiĂšme manuscrit dĂ©crit une nouvelle stratĂ©gie appelĂ©e eclust pour la rĂ©duction de la dimension des variables qui exploite les effets d'une variable d'exposition ayant un impact important sur les mesures GD. Avec eclust, nous avons constatĂ© une amĂ©lioration de la performance prĂ©dictive et de la sĂ©lection de variables par rapport aux mĂ©thodes qui ne tiennent pas compte de l'exposition lors de l'Ă©tape de clustering ou des mĂ©thodes utilisant les donnĂ©es d'origine comme des effets principaux. Nous illustrons ensuite ce cadre de modĂ©lisation en analysant trois ensembles de donnĂ©es provenant de domaines trĂšs diffĂ©rents, chacun avec des donnĂ©es GD, une exposition binaire et une variable rĂ©ponse. Nous fournissons des implĂ©mentations efficaces de tous nos algorithmes dans les logiciels gratuits et Ă  source ouverte

    A Model for Interpretable High Dimensional Interactions

    No full text
    <div>Introduction: Since the introduction of the LASSO, computational approaches to variable selection have been rigorously developed in the statistical literature. The need for such methods has become increasingly important with the advent of high-throughput technologies in genomics and brain imaging studies where it is believed that the number of truly important variables is small relative to the total number of variables. While the focus of these methods has been on additive models, there are several applications where interaction models can reflect biological phenomena and improve statistical power. For example, genome wide association studies (GWAS) have been unable to explain a large proportion of heritability (the variance in phenotype attributable to genetic variants) and it has been suggested that this missing heritability may in part be due to gene-environment interactions. Furthermore, diseases are now thought to be the result of entire biological networks whose states are affected by environmental factors. These systemic changes can induce or eliminate strong correlations between elements in a network without necessarily affecting their mean levels. </div><div><br></div><div>Methods: Therefore, we propose a multivariate penalization procedure for detecting interactions between high dimensional data (p>>np >> n) and an environmental factor, where the effect of this environmental factor on the high dimensional data is widespread and plays a role in predicting the response. Our approach improves on existing procedures for detecting such interactions in several ways; 1) it simultaneously performs model selection and estimation 2) it automatically enforces the strong heredity property, i.e., an interaction term can only be included in the model if the corresponding main effects are in the model 3) it reduces the dimensionality of the problem and leverages the high correlations by transforming the input feature space using network connectivity measures and 4) it leads to interpretable models which are biologically meaningful. </div><div><br></div><div>Results: An extensive simulation study shows that our method outperforms LASSO, Elastic Net and Group LASSO in terms of both prediction accuracy and feature selection. We apply our methods to the NIH pediatric brain development study to refine estimates of which regions of the frontal cortex are associated with intelligence scores, and a sample of mother-child pairs from a prospective birth cohort to identify epigenetic marks observed at birth that help predict childhood obesity. Our method is implemented in an R package http://sahirbhatnagar.com/eclust/</div

    Persévérance et réussite scolaire par le forage de données d'éducation

    Get PDF
    La prĂ©sente recherche a Ă©tĂ© subventionnĂ©e par le ministĂšre de l'Éducation et de l'Enseignement supĂ©rieur dans le cadre du Programme d'aide Ă  la recherche sur l'enseignement et l'apprentissage (PAREA).Comprend des rĂ©fĂ©rences bibliographiques

    PersĂ©vĂ©rance et rĂ©ussite scolaire par le forage de donnĂ©es d’éducation [rapport de recherche]

    No full text
    Comprend des rĂ©fĂ©rences bibliographiquesCette recherche a Ă©tĂ© effectuĂ©e Ă  partir d’une subvention du MinistĂšre de l'Éducation et de l'Enseignement supĂ©rieur.PAREA n°PA-2015-001
    corecore