38 research outputs found

    Efficient network-guided multi-locus association mapping with graph cuts

    Get PDF
    As an increasing number of genome-wide association studies reveal the limitations of attempting to explain phenotypic heritability by single genetic loci, there is growing interest for associating complex phenotypes with sets of genetic loci. While several methods for multi-locus mapping have been proposed, it is often unclear how to relate the detected loci to the growing knowledge about gene pathways and networks. The few methods that take biological pathways or networks into account are either restricted to investigating a limited number of predetermined sets of loci, or do not scale to genome-wide settings. We present SConES, a new efficient method to discover sets of genetic loci that are maximally associated with a phenotype, while being connected in an underlying network. Our approach is based on a minimum cut reformulation of the problem of selecting features under sparsity and connectivity constraints that can be solved exactly and rapidly. SConES outperforms state-of-the-art competitors in terms of runtime, scales to hundreds of thousands of genetic loci, and exhibits higher power in detecting causal SNPs in simulation studies than existing methods. On flowering time phenotypes and genotypes from Arabidopsis thaliana, SConES detects loci that enable accurate phenotype prediction and that are supported by the literature. Matlab code for SConES is available at http://webdav.tuebingen.mpg.de/u/karsten/Forschung/scones/Comment: 20 pages, 6 figures, accepted at ISMB (International Conference on Intelligent Systems for Molecular Biology) 201

    Dynamic Risk Prediction of 30-Day Mortality in Patients With Advanced Lung Cancer:Comparing Five Machine Learning Approaches

    Get PDF
    International audiencePURPOSE Administering systemic anticancer treatment (SACT) to patients near death can negatively affect their health-related quality of life. Late SACT administrations should be avoided in these cases. Machine learning techniques could be used to build decision support tools leveraging registry data for clinicians to limit late SACT administration. MATERIALS AND METHODS Patients with advanced lung cancer who were treated at the Department of Oncology, Aalborg University Hospital and died between 2010 and 2019 were included (N = 2,368). Diagnoses, treatments, biochemical data, and histopathologic results were used to train predictive models of 30-day mortality using logistic regression with elastic net penalty, random forest, gradient tree boosting, multilayer perceptron, and long short-term memory network. The importance of the variables and the clinical utility of the models were evaluated. RESULTS The random forest and gradient tree boosting models outperformed other models, whereas the artificial neural network–based models underperformed. Adding summary variables had a modest effect on performance with an increase in average precision from 0.500 to 0.505 and from 0.498 to 0.509 for the gradient tree boosting and random forest models, respectively. Biochemical results alone contained most of the information with a limited degradation of the performances when fitting models with only these variables. The utility analysis showed that by applying a simple threshold to the predicted risk of 30-day mortality, 40% of late SACT administrations could have been prevented at the cost of 2% of patients stopping their treatment 90 days before death. CONCLUSION This study demonstrates the potential of a decision support tool to limit late SACT administration in patients with cancer. Further work is warranted to refine the model, build an easy-to-use prototype, and conduct a prospective validation study

    Outils d'apprentissage statistique pour la découverte de biomarqueurs

    No full text
    My research focuses on the development of machine learning tools for therapeutic research. In particular, my goal is to propose computational tools that can exploit data sets to extract biological hypotheses that explain, at a genomic or molecular level, the differences between samples that can be observed at a macroscopic scale. Such tools are necessary to the development of precision medicine, which requires identifying the characteristics, genomic or otherwise, that explain the differences in prognostic or therapeutic response between patients who exhibit the same symptoms. These questions can often be formulated as feature selection problems. However, the typical data sets contain many more features than samples, which poses statistical challenges. To address these challenges, my work is organized in three axes. First, knowledge accumulated on biological system can often be represented as biological networks. Under the hypothesis that features connected on these networks are likely to work together towards a phenotype, we propose to use biological networks to guide feature selection algorithms. The idea here is to define constraints that encourage the selected features to be connected on a given network. The formulation we proposed, which can be seen as a special case of what I call regularized relevance, allows us to efficiently select features on data sets containing hundreds of thousands of variables.Second, to compensate the small number of available samples, so-called multitask methods solve several related problems, or tasks, simultaneously. We have generalized regularized relevance to this context. I have also worked on the case where one can define a similarity between tasks, to impose that the more similar two tasks are, the more the two sets of features that are selected for them are. Such approaches can be used to study the response to different drug treatments: one can then used the similarity between the molecular structures of the drugs, a topic I have studied in the course of my PhD.Finally, most feature selection methods used in genomics can only explain the phenomenon of interest by linear effects. However, a large body of literature indicates that regions of the genome can interact nonlinearly. Modeling such interactions, which are called epistatic, exacerbate the aforementioned statistical challenges, and creates computational issues: evaluating all possible combinations of variables becomes intractable. My work in this domain addresses these computational issues, as well as the statistical challenges one encounters when modeling quadratic interactions between pairs of regions of the genome. More recently, we have also developed approaches that allow to model more complex interactions thanks to kernel methods.Mes travaux de recherche s’inscrivent dans le cadre du développement de techniques d’apprentissage statistique (« machine learning ») pour la recherche thérapeutique. Ils visent en particulier à proposer des outils informatiques permettant d’exploiter des jeux de données pour en extraire des hypothèses biologiques expliquant au niveau génomique ou moléculaire les différences entre échantillons observées à un niveau macroscopique. De tels outils sont nécessaires à la mise en œuvre de la médecine de précision, qui requiert d’identifier les caractéristiques, génomiques ou autres, expliquant les différences de pronostic ou de réponse thérapeutique entre patients présentant les mêmes symptômes. Ces questions peuvent souvent être formulées comme des problèmes de sélection de variables. Les jeux de données utilisés, cependant, contiennent généralement largement plus de variables que d’échantillons, ce qui pose des difficultés statistiques. Pour répondre à ces défis, mes travaux s’orientent autour de trois axes. Premièrement, les connaissances accumulées sur les systèmes biologiques peuvent souvent être représentées sous la forme de réseaux biologiques. Sous l’hypothèse que les variables connectées par ces réseaux sont susceptibles d’agir conjointement sur un phénotype, nous proposons d’utiliser ces réseaux pour guider un algorithme de sélection de variables. Il s’agit ici d’utiliser des contraintes qui encouragent les variables sélectionnées à être connectées sur un réseau donné. La formulation que nous avons proposée, qui s’inscrit dans le cadre plus large de ce que j’appelle la pertinence régularisée, permet de résoudre efficacement le problème de sélection de variables sur des jeux de données comportant des centaines de milliers de variables. Deuxièmement, pour compenser le faible nombre d’échantillons disponibles, les méthodes dites multitâches résolvent simultanément plusieurs problèmes, ou tâches, proches. Nous avons étendu la pertinence régularisée à ce contexte. Je me suis aussi intéressée au cas où il est possible de définir une similarité entre tâches, afin d’imposer que les variables sélectionnées pour deux tâches soient d’autant plus similaires que les deux tâches sont semblables. Ces approches sont pertinentes dans le cas de l’étude de la réponse à différents traitements médicamenteux : on peut alors utiliser la similarité entre les structures moléculaires de ces médicaments, sujet que j’ai étudié pendant ma thèse. Enfin, la plupart des approches de sélection de variables utilisées dans le contexte de la génomique ne peuvent expliquer le phénomène d’intérêt que par des effets linéaires. Cependant, de nombreux travaux indiquent que les régions du génome peuvent interagir de façon non-linéaire. Modéliser de telles interactions, que l’on qualifie d’épistatiques, aggrave cependant les problèmes statistiques déjà rencontrés précédemment, et crée aussi des problèmes computationnels : il devient difficile d’évaluer toutes les combinaisons possibles de variables. Mes travaux portent aussi bien sur les difficultés calculatoires que sur les difficultés statistiques rencontrées dans la modélisation d’interactions quadratiques entre paires de régions du génomes. Plus récemment, nous avons aussi développé des approches permettant la modélisation d’interactions plus complexes grâce à des méthodes à noyau

    Machine learning and genomics: precision medicine versus patient privacy

    No full text
    International audienc

    Multitask group Lasso for Genome Wide association Studies in diverse populations

    No full text
    International audienceGenome-Wide Association Studies, or GWAS, aim at finding Single Nucleotide Polymorphisms (SNPs) that are associated with a phenotype of interest. GWAS are known to suffer from the large dimensionality of the data with respect to the number of available samples. Other limiting factors include the dependency between SNPs, due to linkage disequilibrium (LD), and the need to account for population structure, that is to say, confounding due to genetic ancestry.We propose an efficient approach for the multivariate analysis of multi-population GWAS data based on a multitask group Lasso formulation. Each task corresponds to a subpopulation of the data, and each group to an LD-block. This formulation alleviates the curse of dimensionality, and makes it possible to identify disease LD-blocks shared across populations/tasks, as well as some that are specific to one population/task. In addition, we use stability selection to increase the robustness of our approach. Finally, gap safe screening rules speed up computations enough that our method can run at a genome-wide scale.To our knowledge, this is the first framework for GWAS on diverse populations combining feature selection at the LD-groups level, a multitask approach to address population structure, stability selection, and safe screening rules. We show that our approach outperforms state-of-the-art methods on both a simulated and a real-world cancer datasets

    martini: an R package for genome-wide association studies using SNP networks

    No full text
    Systems biology shows that genes that are related to the same phenotype are often functionally related. We can take advantage of this to discover new genes that affect a phenotype. However, the natural unit of analysis in genome-wide association studies (GWAS) is not the gene, but the single nucleotide polymorphism, or SNP. We introduce martini, an R package to build SNP co-function networks and use them to conduct GWAS. In SNP networks, two SNPs are connected if there is evidence they jointly contribute to the same biological function. By leveraging such information in GWAS, we search SNPs that are not only strongly associated with a phenotype, but also functionally related. This, in turn, boosts discovery and interpretability. Martini builds such networks using three sources of information: genomic position, gene annotations, and gene-gene interactions. The resulting SNP networks involve hundreds of thousands of nodes and millions of edges, making their exploration computationally intensive. Martini implements two network-guided biomarker discovery algorithms based on graph cuts that can handle such large networks: SConES and SigMod. They both seek a small subset of SNPs with high association scores with the phenotype of interest and densely interconnected in the network. Both algorithms use parameters that control the relative importance of the SNPs' association scores, the number of SNPs selected, and their interconnection. Martini includes a cross-validation procedure to set these parameters automatically. Lastly, martini includes tools to visualize the selected SNPs' network and association properties. Martini is available on GitHub (https://github.com/hclimente/martini) and Bioconductor (https://www.bioconductor.org/packages/release/bioc/html/martini.html)

    Efficient multi-task chemogenomics for drug specificity prediction

    No full text
    International audienceAdverse drug reactions, also called side effects, range from mild to fatal clinical events and significantly affect the quality of care. Among other causes, side effects occur when drugs bind to proteins other than their intended target. As experimentally testing drug specificity against the entire proteome is out of reach, we investigate the application of chemogenomics approaches. We formulate the study of drug specificity as a problem of predicting interactions between drugs and proteins at the proteome scale. We build several benchmark datasets, and propose NN-MT, a multi-task Support Vector Machine (SVM) algorithm that is trained on a limited number of data points, in order to solve the computational issues or proteome-wide SVM for chemogenomics. We compare NN-MT to different state-of-the-art methods, and show that its prediction performances are similar or better, at an efficient calculation cost. Compared to its competitors, the proposed method is particularly efficient to predict (protein, ligand) interactions in the difficult double-orphan case, i.e. when no interactions are previously known for the protein nor for the ligand. The NN-MT algorithm appears to be a good default method providing state-of-the-art or better performances, in a wide range of prediction scenario that are considered in the present study: proteome-wide prediction, protein family prediction, test (protein, ligand) pairs dissimilar to pairs in the train set, and orphan cases

    Nonlinear post-selection inference for genome-wide association studies

    No full text
    International audienceAssociation testing in genome-wide association studies (GWAS) is often performed at either the SNP level or the gene level. The two levels can bring different insights into disease mechanisms. In the present work, we provide a novel approach based on nonlinear post-selection inference to bridge the gap between them. Our approach selects, within a gene, the SNPs or LD blocks most associated with the phenotype, before testing their combined effect. Both the selection and the association testing are conducted nonlinearly. We apply our tool to the study of BMI and its variation in the UK BioBank. In this study, our approach outperformed other gene-level association testing tools, with the unique benefit of pinpointing the causal SNPs
    corecore