260 research outputs found

    A Sparse Graph-Structured Lasso Mixed Model for Genetic Association with Confounding Correction

    Full text link
    While linear mixed model (LMM) has shown a competitive performance in correcting spurious associations raised by population stratification, family structures, and cryptic relatedness, more challenges are still to be addressed regarding the complex structure of genotypic and phenotypic data. For example, geneticists have discovered that some clusters of phenotypes are more co-expressed than others. Hence, a joint analysis that can utilize such relatedness information in a heterogeneous data set is crucial for genetic modeling. We proposed the sparse graph-structured linear mixed model (sGLMM) that can incorporate the relatedness information from traits in a dataset with confounding correction. Our method is capable of uncovering the genetic associations of a large number of phenotypes together while considering the relatedness of these phenotypes. Through extensive simulation experiments, we show that the proposed model outperforms other existing approaches and can model correlation from both population structure and shared signals. Further, we validate the effectiveness of sGLMM in the real-world genomic dataset on two different species from plants and humans. In Arabidopsis thaliana data, sGLMM behaves better than all other baseline models for 63.4% traits. We also discuss the potential causal genetic variation of Human Alzheimer's disease discovered by our model and justify some of the most important genetic loci.Comment: Code available at https://github.com/YeWenting/sGLM

    Sparse Probit Linear Mixed Model

    Full text link
    Linear Mixed Models (LMMs) are important tools in statistical genetics. When used for feature selection, they allow to find a sparse set of genetic traits that best predict a continuous phenotype of interest, while simultaneously correcting for various confounding factors such as age, ethnicity and population structure. Formulated as models for linear regression, LMMs have been restricted to continuous phenotypes. We introduce the Sparse Probit Linear Mixed Model (Probit-LMM), where we generalize the LMM modeling paradigm to binary phenotypes. As a technical challenge, the model no longer possesses a closed-form likelihood function. In this paper, we present a scalable approximate inference algorithm that lets us fit the model to high-dimensional data sets. We show on three real-world examples from different domains that in the setup of binary labels, our algorithm leads to better prediction accuracies and also selects features which show less correlation with the confounding factors.Comment: Published version, 21 pages, 6 figure

    LIMIX: genetic analysis of multiple traits

    Get PDF
    Multi-trait mixed models have emerged as a promising approach for joint analyses of multiple traits. In principle, the mixed model framework is remarkably general. However, current methods implement only a very specific range of tasks to optimize the necessary computations. Here, we present a multi-trait modeling framework that is versatile and fast: LIMIX enables to exibly adapt mixed models for a broad range of applications with different observed and hidden covariates, and variable study designs. To highlight the novel modeling aspects of LIMIX we performed three vastly different genetic studies: joint GWAS of correlated blood lipid phenotypes, joint analysis of the expression levels of the multiple transcript-isoforms of a gene, and pathway-based modeling of molecular traits across environments. In these applications we show that LIMIX increases GWAS power and phenotype prediction accuracy, in particular when integrating stepwise multi-locus regression into multi-trait models, and when analyzing large numbers of traits. An open source implementation of LIMIX is freely available at: https://github.com/PMBio/limix

    Machine learning approaches for high-dimensional genome-wide association studies

    Get PDF
    Formålet med Genome-wide association studies (GWAS) er å finne statistiske sammenhenger mellom genetiske varianter og egenskaper av interesser. De genetiske variantene som forklarer mye av variasjonene i genomfattende genekspresjoner kan medføre konfunderende analyser av kvantitative egenskaper ved ekspresjonsplasseringer (eQTL). For å betrakte konfunderende faktorene, presenterte vi LVREML-metoden i artikkel I, en metode som er konseptuelt analogt med å estimere faste og tilfeldige effekter i Lineære Blandede modeller (LMM). Vi viste at de latente variablene med “Maximum likelihood” alltid kan velges ortogonalt til de kjente faktorene (som genetiske variasjoner). Dette indikerer at “Maximum likelihood” variablene forklarer utvalgsvariansene som ikke allerede er forklart av de genetiske variantene i modellen. For å kartlegge hvilke egenskaper som påvirkes av de identifiserte genetiske variantene, må vi reversere den funksjonelle relasjonen mellom genotyper og egenskaper. I denne sammenhengen er en “multi-trait” metode mer fordelaktige enn å studere egenskapene individuelt. “Multi-trait”-metoden drar nytte av økt kapasitet som følge av å vurdere kovarianser på tvers av egenskaper, og redusert multiple tester, fordi det trengs en enkelt test for å teste for sammenhenger til et sett med egenskaper. I artikkel II analyserte vi ulike maskinlæringsmetoder (Naive Bayes/independent univariate correlation, random forests og support vector machines) for omvendt regresjon i multi-trekk GWAS, ved bruk av genotyper, genuttrykksdata og “groundtruth” transcriptional regulatory networks fra DREAM5 SysGen Challenge og fra en krysning mellom to gjærstammer for å evaluere metoder. I artikkel III utvidet vi metoden ovenfor til å behandle menneskelig data. En viktig forskjell mellom data fra artikkel II og artikkel III er at vi ikke har “Groundtruth” data tilgjengelig for sistnevnte. Vi brukte genotypen og Magnetresonanstomografi (MRI) data hentet fra ADNI databasen. Resultatene fra både artikkel II og artikkel III viste at resultat av genotypeprediksjon varierte på tvers av genetiske varianter. Dette hjulpet med å identifisere genomiske regioner som er assosiert med stort antall egenskaper i høydimensjonale fenotypiske data. Vi observerte også at koeffisientene til maskinlæringsmodeller korrelerte med styrken til assosiasjonene mellom varianter og egenskaper. Resultatene våre viste også at ikke-lineære maskin-læringsmetoder som “random forests” identifiserte genetiske varianter tydeligere enn de lineære metodene. Spesielt observerte vi i artikkel III at “random forests” var i stand til å identifisere enkeltnukleotidpolymorfismer (SNP-er) som var forskjellige fra de som ble identifisert “ridge” og“lasso” regresjonsmetodene. Ytterligere analyse viste at de identifiserte SNP-ene tilhørte gener som tidligere var assosiert med hjernerelaterte lidelser.Genome-wide association studies (GWAS) aim to find statistical associations between genetic variants and traits of interests. The genetic variants that explain a lot of variation in genome-wide gene expression may lead to confounding in expression quantitative trait loci (eQTL) analyses. To account for these confounding factors, in Article I we proposed LVREML, a method conceptually analogous to estimating fixed and random effects in linear mixed models (LMM). We showed that the maximum-likelihood latent variables can always be chosen orthogonal to the known factors (such genetic variants). This indicates that the maximum-likelihood variables explain the sample covariances that is not already explained by the genetic variants in the model. For identifying which traits are effected by the identified genetic variants, we need to reverse the functional relation between genotypes and traits. In this regard, multitrait approaches are more advantageous than studying the traits individually. The multi-trait approaches benefit from increased power from considering cross-trait covariances and reduced multiple testing burden because a single test is needed to test for associations to a set of traits. In Article II, we analyzed various machine learning methods (ridge regression, Naive Bayes/independent univariate correlation, random forests and support vector machines) for reverse regression in multi-trait GWAS, using genotypes, gene expression data and ground-truth transcriptional regulatory networks from the DREAM5 SysGen Challenge and from a cross between two yeast strains to evaluate methods. In Article III, we extended the above approach to human dataset. An important difference between data from Article II and Article III is that we do not have groundtruth data available for the latter. We used the genotype and brain-imaging features extracted from the MRIs obtained from the ADNI database. The results from both Article II and Article III showed that the genotype prediction performance varied across genetic variants. This helped in identifying genomic regions that are associated with high number of traits in high-dimensional phenotypic data. We also observed that the feature coefficients of fitted machine learning models correlated with the strength of association between variants and traits. Our results also showed that non-linear machine learning methods like random forests identified genetic variants distinct from the linear methods. In particular, we observed in Article III that random forest was able to identify single-nueclotide-polymorphisms (SNPs) that were distinct from the ones identified by ridge and lasso regression. Further analysis showed that the identified SNPs belonged to genes previously associated with brain-related disorders.Doktorgradsavhandlin

    Univariate and multivariate statistical approaches for the analyses of omics data: sample classification and two-block integration.

    Get PDF
    The wealth of information generated by high-throughput omics technologies in the context of large-scale epidemiological studies has made a significant contribution to the identification of factors influencing the onset and progression of common diseases. Advanced computational and statistical modelling techniques are required to manipulate and extract meaningful biological information from these omics data as several layers of complexity are associated with them. Recent research efforts have concentrated in the development of novel statistical and bioinformatic tools; however, studies thoroughly investigating the applicability and suitability of these novel methods in real data have often fallen behind. This thesis focuses in the analyses of proteomics and transcriptomics data from the EnviroGenoMarker project with the purpose of addressing two main research objectives: i) to critically appraise established and recently developed statistical approaches in their ability to appropriately accommodate the inherently complex nature of real-world omics data and ii) to improve the current understanding of a prevalent condition by identifying biological markers predictive of disease as well as possible biological mechanisms leading to its onset. The specific disease endpoint of interest corresponds to B-cell Lymphoma, a common haematological malignancy for which many challenges related to its aetiology remain unanswered. The seven chapters comprising this thesis are structured in the following manner: the first two correspond to introductory chapters where I describe the main omics technologies and statistical methods employed for their analyses. The third chapter provides a description of the epidemiological project giving rise to the study population and the disease outcome of interest. These are followed by three results chapters that address the research aims described above by applying univariate and multivariate statistical approaches for sample classification and data integration purposes. A summary of findings, concluding general remarks and discussion of open problems offering potential avenues for future research are presented in the final chapter.Open Acces

    Disentangling causal webs in the brain using functional Magnetic Resonance Imaging: A review of current approaches

    Get PDF
    In the past two decades, functional Magnetic Resonance Imaging has been used to relate neuronal network activity to cognitive processing and behaviour. Recently this approach has been augmented by algorithms that allow us to infer causal links between component populations of neuronal networks. Multiple inference procedures have been proposed to approach this research question but so far, each method has limitations when it comes to establishing whole-brain connectivity patterns. In this work, we discuss eight ways to infer causality in fMRI research: Bayesian Nets, Dynamical Causal Modelling, Granger Causality, Likelihood Ratios, LiNGAM, Patel's Tau, Structural Equation Modelling, and Transfer Entropy. We finish with formulating some recommendations for the future directions in this area

    f-scLVM: scalable and versatile factor analysis for single-cell RNA-seq.

    Get PDF
    Single-cell RNA-sequencing (scRNA-seq) allows studying heterogeneity in gene expression in large cell populations. Such heterogeneity can arise due to technical or biological factors, making decomposing sources of variation difficult. We here describe f-scLVM (factorial single-cell latent variable model), a method based on factor analysis that uses pathway annotations to guide the inference of interpretable factors underpinning the heterogeneity. Our model jointly estimates the relevance of individual factors, refines gene set annotations, and infers factors without annotation. In applications to multiple scRNA-seq datasets, we find that f-scLVM robustly decomposes scRNA-seq datasets into interpretable components, thereby facilitating the identification of novel subpopulations

    Methods for Stratification and Validation Cohorts: A Scoping Review

    Get PDF
    Personalized medicine requires large cohorts for patient stratification and validation of patient clustering. However, standards and harmonized practices on the methods and tools to be used for the design and management of cohorts in personalized medicine remain to be defined. This study aims to describe the current state-of-the-art in this area. A scoping review was conducted searching in PubMed, EMBASE, Web of Science, Psycinfo and Cochrane Library for reviews about tools and methods related to cohorts used in personalized medicine. The search focused on cancer, stroke and Alzheimer's disease and was limited to reports in English, French, German, Italian and Spanish published from 2005 to April 2020. The screening process was reported through a PRISMA flowchart. Fifty reviews were included, mostly including information about how data were generated (25/50) and about tools used for data management and analysis (24/50). No direct information was found about the quality of data and the requirements to monitor associated clinical data. A scarcity of information and standards was found in specific areas such as sample size calculation. With this information, comprehensive guidelines could be developed in the future to improve the reproducibility and robustness in the design and management of cohorts in personalized medicine studies
    corecore