1,302 research outputs found

    Machine Learning Patterns for Neuroimaging-Genetic Studies in the Cloud

    Get PDF
    International audienceBrain imaging is a natural intermediate phenotype to understand the link between genetic information and behavior or brain pathologies risk factors. Massive efforts have been made in the last few years to acquire high-dimensional neuroimaging and genetic data on large cohorts of subjects. The statistical analysis of such data is carried out with increasingly sophisticated techniques and represents a great computational challenge. Fortunately, increasing computational power in distributed architectures can be harnessed, if new neuroinformatics infrastructures are designed and training to use these new tools is provided. Combining a MapReduce framework (TomusBLOB) with machine learning algorithms (Scikit-learn library), we design a scalable analysis tool that can deal with non-parametric statistics on high-dimensional data. End-users describe the statistical procedure to perform and can then test the model on their own computers before running the very same code in the cloud at a larger scale. We illustrate the potential of our approach on real data with an experiment showing how the functional signal in subcortical brain regions can be significantly fit with genome-wide genotypes. This experiment demonstrates the scalability and the reliability of our framework in the cloud with a two weeks deployment on hundreds of virtual machines

    Second-generation PLINK: rising to the challenge of larger and richer datasets

    Get PDF
    PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information. The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.Comment: 2 figures, 1 additional fil

    A Weighted U Statistic for Genetic Association Analyses of Sequencing Data

    Full text link
    With advancements in next generation sequencing technology, a massive amount of sequencing data are generated, offering a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. Nevertheless, this poses a great challenge for the statistical analysis of high-dimensional sequencing data. The association analyses based on traditional statistical methods suffer substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. We developed a weighted U statistic, referred to as WU-seq, for the high-dimensional association analysis of sequencing data. Based on a non-parametric U statistic, WU-SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be applied to a variety of phenotypes. Through simulation studies and an empirical study, we showed that WU-SEQ outperformed a commonly used SKAT method when the underlying assumptions were violated (e.g., the phenotype followed a heavy-tailed distribution). Even when the assumptions were satisfied, WU-SEQ still attained comparable performance to SKAT. Finally, we applied WU-seq to sequencing data from the Dallas Heart Study (DHS), and detected an association between ANGPTL 4 and very low density lipoprotein cholesterol

    High-throughput computational methods and software for quantitative trait locus (QTL) mapping

    Get PDF
    De afgelopen jaren zijn vele nieuwe technologieen zoals Tiling arrays en High throughput DNA sequencing een belangrijke rol gaan spelen binnen het onderzoeksveld van de systeem genetica. Voor onderzoekers is het extreem belangrijk om te begrijpen dat deze methodes hun manier van werken zullen gaan beinvloeden. Deit proefschrift beschrijft mogelijke oplossingen voor deze 'Big Data' lawine die systemen genetica heeft getroffen.Dit proefschrift beschrijft de werkzaamheden uitgevoerd aan het Groningen Bioinformatics Centre om slimmere en geoptimaliseerde algoritmen zoals Pheno2Geno en MQM te ontwikkelen en een systeem om 'collaborative' research mogelijk te maken genaamd xQTL werkbank om door middel van high-throughput systemen genetica data te analyseren.In recent years many new technologies such as tiling arrays and high-throughput sequencinghave come to play an important role in systems genetics research. For researchers it is ofthe utmost importance to understand how this affects their research. This work describespossible solutions to this ‘Big Data’ avalanche which has hit systems genetics.This thesis describes the work carried out during the author’s 4 year PHD project at theGroningen Bioinformatics Centre to develop smarter and more optimized algorithms suchas Pheno2Geno and MQM, and to use a collaborative approach such as xQTL workbench tostore and analyse high-throughput systems genetics data

    Advanced Methods for Discovering Genetic Markers Associated with High Dimensional Imaging Data

    Get PDF
    Imaging genetic studies have been widely applied to discover genetic factors of inherited neuropsychiatric diseases. Despite the notable contribution of genome-wide association studies (GWAS) in neuroimaging research, it has always been difficult to efficiently perform association analysis on imaging phenotypes. There are several challenges arising from this topic, such as the large dimensionality of imaging data and genetic data, the potential spatial dependency of imaging phenotypes and the computational burden of the GWAS problem. All the aforementioned issues motivate us to investigate new statistical methods in neuroimaging genetic analysis. In the first project, we develop a hierarchical functional principal regression model (HFPRM) to simultaneously study diffusion tensor bundle statistics on multiple fiber tracts. Theoretically, the asymptotic distribution of the global test statistic on the common factors has been studied. Simulations are conducted to evaluate the finite sample performance of HFPRM. Finally, we apply our method to a GWAS of a neonate population to explore important genetic architecture in early human brain development. In the second project, we consider an association test between functional data acquired on a single curve and scalar variables in a varying coefficient model. We propose a functional projection regression model and an associated global test statistic to aggregate weak signals across the domain of functional data. Theoretically, we examine the asymptotic distribution of the global test statistic and provide a strategy to adaptively select the tuning parameter. Simulation experiments show that the proposed test outperforms existing state-of-the-art methods in functional statistical inference. We also apply the proposed method to a GWAS in the UK Biobank dataset. In the third project, we introduce an adaptive projection regression model (APRM) to perform statistical inference on high dimensional imaging responses in the presence of high correlations. Dimension reduction of the phenotypes is achieved through a linear projection regression model. We also implement an adaptive inference procedure to detect signals at multiple levels. Numerical simulations demonstrate that APRM outperforms many state-of-the-art methods in high dimensional inference. Finally, we apply APRM to a GWAS of volumetric data on 93 regions of interest in the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset.Doctor of Philosoph

    패스웨이 정보를 이용한 대용량 유전체 자료의 통계적 분석

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 자연과학대학 협동과정 생물정보학전공, 2018. 2. 박태성.hence our method considers the correlation of pathways and handles an entire dataset in a single model. In addition, PHARAOH-multi further extends the original model into multivariate analysis, while keeping the advantages of our previous approach. We extend PHARAOH to enable analysis of multiple traits using hierarchical components of genetic variants. In addition, PHARAOH-multi can identify associations between multiple phenotypes and multiple pathways, with a single model, in the presence of subsequent genes within pathways, as a hierarchy. Through simulation studies, PHARAOH was shown to have higher statistical power than the existing pathway-based methods. In addition, a detailed simulation study for PHARAOH-multi demonstrated advantages of multivariate analysis, compared to univariate analysis, and comparison studies showed the proposed approach to outperform existing multivariate pathway-based methods. Finally, we conducted an analysis of whole-exome sequencing data from a Korean population study to compare the performance between the proposed methods with the previous pathway-based methods, using validated pathway databases. As a result, PHARAOH successfully discovered 13 pathways for the liver enzymes, and PHARAOH-multi identified 8 pathways for multiple metabolic traits. Through a replication study using an independent, large-scale exome chip dataset, we replicated many pathways that were discovered by the proposed methods and showed their biological relationship to the target traits.In the past two decades, rapid advances in DNA sequencing technology have enabled extensive investigations into human genetic architecture, especially for the identification of genetic variants associated with complex traits. In particular, genome-wide association studies (GWAS) have played a key role in identifying genetic associations between Single Nucleotide Variants (SNVs) and many complex biological pathologies. However, the genetic variants identified by many successful GWAS have explained only a modest part of heritability for most of phenotypes, and many hypotheses have been proposed to address so-called missing heritability issue, such as rare variant association, gene-gene interaction or multi-omics integration. Methods for rare variants analysis arose from extending individual variant-level approaches to those at the gene-level, and extending those at the gene level to multiple phenotypes. In this trend, as the number of publicly available biological resources is increasing, recent methods for analyzing rare variants utilize pathway knowledge as a priori information. In this respect, many statistical methods for pathway-based analyses using rare variants have been proposed to analyze pathways individually. However, neglecting correlations between multiple pathways can result in misleading solutions, and pathway-based analyses of large-scale genetic datasets require massive computational burden. Moreover, while a number of methods for pathway-based rare-variant analysis of multiple phenotypes have been proposed, no method considers a unified model that incorporate multiple pathways. In this thesis, we propose novel statistical methods to analyze large-scale genetic dataset using pathway information, Pathway-based approach using HierArchical components of collapsed RAre variants Of High-throughput sequencing data (PHARAOH) and PHARAOH-multi. PHARAOH extends generalized structural component analysis, and implements the method based on the framework of generalized linear models, to accommodate phenotype data arising from a variety of exponential family distributions. PHARAOH constructs a single hierarchical model that consists of collapsed gene-level summaries and pathways, and analyzes entire pathways simultaneously by imposing ridge-type penalties on both gene and pathway coefficient estimatesIntroduction 1 1.1. The background on genetic association studies 1 1.1.1. Genome-wide association studies and the missing heritability 1 1.1.2. Rare variant analyses 6 1.2. The purpose of this study 10 1.3. Outline of the thesis 12 An overview of existing methods 13 2.1. Review of pathway-based methods 13 2.2.1. Competitive and self-contained tests: WKS and DRB 16 2.2.2. Self-contained test: aSPU 19 2.2.3. Self-contained test: MARV 21 2.3. Generalized structured component analysis 23 2.3.1. The model 23 2.3.2. Parameter estimation 25 Pathway-based approach using rare variants 27 3.1. Introduction 27 3.2. Methods 29 3.2.1. Notations and the model 29 3.2.2. An exemplary structure 32 3.3.3. Parameter estimation 33 3.4. Simulation study 37 3.4.1. The simulation dataset 37 3.4.2. Comparison of methods using simulation dataset 38 3.5. Application to analysis of liver enzymes 44 3.5.1. Whole exome sequencing dataset for pathway discovery 44 3.5.2. Replication study using exome chip dataset 53 3.6. Discussion 56 Multivariate pathway-based approach using rare variants 60 4.1. Introduction 60 4.2. Methods 61 4.2.1. Notations and the model 61 4.2.2. An exemplary structure 63 4.2.3. Parameter estimation 66 4.2.4. Significance testing 69 4.2.5. Multiple testing correction 75 4.3. Simulation study 77 4.3.1. The simulation model 74 4.3.2. Evaluation with simulated data 88 4.4. Application to the real datasets 88 4.4.1. Real data discovery from whole-exome sequencing dataset 95 4.4.2. Replication study using independent exome chip dataset 98 4.5. Discussion 99 Summary & Conclusions 104 Bibliography 108 초 록 127Docto
    corecore