6,815 research outputs found

    A Quadratically Regularized Functional Canonical Correlation Analysis for Identifying the Global Structure of Pleiotropy with NGS Data

    Full text link
    Investigating the pleiotropic effects of genetic variants can increase statistical power, provide important information to achieve deep understanding of the complex genetic structures of disease, and offer powerful tools for designing effective treatments with fewer side effects. However, the current multiple phenotype association analysis paradigm lacks breadth (number of phenotypes and genetic variants jointly analyzed at the same time) and depth (hierarchical structure of phenotype and genotypes). A key issue for high dimensional pleiotropic analysis is to effectively extract informative internal representation and features from high dimensional genotype and phenotype data. To explore multiple levels of representations of genetic variants, learn their internal patterns involved in the disease development, and overcome critical barriers in advancing the development of novel statistical methods and computational algorithms for genetic pleiotropic analysis, we proposed a new framework referred to as a quadratically regularized functional CCA (QRFCCA) for association analysis which combines three approaches: (1) quadratically regularized matrix factorization, (2) functional data analysis and (3) canonical correlation analysis (CCA). Large-scale simulations show that the QRFCCA has a much higher power than that of the nine competing statistics while retaining the appropriate type 1 errors. To further evaluate performance, the QRFCCA and nine other statistics are applied to the whole genome sequencing dataset from the TwinsUK study. We identify a total of 79 genes with rare variants and 67 genes with common variants significantly associated with the 46 traits using QRFCCA. The results show that the QRFCCA substantially outperforms the nine other statistics.Comment: 64 pages including 12 figure

    Sparse Probit Linear Mixed Model

    Full text link
    Linear Mixed Models (LMMs) are important tools in statistical genetics. When used for feature selection, they allow to find a sparse set of genetic traits that best predict a continuous phenotype of interest, while simultaneously correcting for various confounding factors such as age, ethnicity and population structure. Formulated as models for linear regression, LMMs have been restricted to continuous phenotypes. We introduce the Sparse Probit Linear Mixed Model (Probit-LMM), where we generalize the LMM modeling paradigm to binary phenotypes. As a technical challenge, the model no longer possesses a closed-form likelihood function. In this paper, we present a scalable approximate inference algorithm that lets us fit the model to high-dimensional data sets. We show on three real-world examples from different domains that in the setup of binary labels, our algorithm leads to better prediction accuracies and also selects features which show less correlation with the confounding factors.Comment: Published version, 21 pages, 6 figure

    κ°€μ‘± 기반 희귀 변이 μ—°κ΄€ 뢄석을 μœ„ν•œ 뢄석 μ•Œκ³ λ¦¬μ¦˜ 개발

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사)--μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› :μžμ—°κ³Όν•™λŒ€ν•™ ν˜‘λ™κ³Όμ • 생물정보학전곡,2019. 8. μ›μ„±ν˜Έ.μˆ˜λ§Žμ€ μ „μž₯μœ μ „μ²΄μ—°κ΄€λΆ„μ„(GWAS)에도 λΆˆκ΅¬ν•˜κ³  μ§ˆλ³‘μ—°κ΄€ μœ μ „μ²΄λ³€μ΄(DSL)λŠ” μ œν•œμ μœΌλ‘œλ§Œ λ°œκ²¬λ˜μ—ˆλŠ”λ° μ΄λŠ” μ‹€μ’…λœ μ§ˆλ³‘μœ μ „μ„±(missing heritability)에 κΈ°μΈν•œλ‹€. ν•œ λ²ˆμ— κΈ΄ λ¦¬λ“œ(read)λ₯Ό μ‹œν€€μ‹±ν•˜λŠ” κΈ°μˆ μ€ 이λ₯Ό 보완해 쀄 κ²ƒμœΌλ‘œ κΈ°λŒ€λ˜μ–΄ μ™”μœΌλ©°, 이 기술의 λ°œλ‹¬ 덕뢄에 μœ μ „μ²΄μ—°κ΄€λΆ„μ„μ„ ν™œμš©ν•˜μ—¬ μ—¬λŸ¬ 희귀(rare) 및 일반(common) 인과 변이λ₯Ό λ°œκ²¬ν•  수 μžˆμ—ˆλ‹€. κ·ΈλŸ¬λ‚˜ κ½€ λ§Žμ€ μƒ˜ν”Œμ„ μ΄μš©ν•œ μ‹€ν—˜μ—μ„œλ„ 단일 변이λ₯Ό λŒ€μƒμœΌλ‘œν•œ μ „μž₯μœ μ „μ²΄μ—°κ΄€λΆ„μ„μ€ λΆ€μ •μ˜€λ₯˜(false negative) λ¬Έμ œμ—μ„œ 자유둜울 수 μ—†λ‹€. 이에 희귀변이 μ—°κ΄€ λΆ„μ„μ˜ κ²€μ •λ ₯을 μ¦κ°€μ‹œν‚€κΈ° μœ„ν•΄ μƒλ¬Όν•™μ μœΌλ‘œ 연관이 μžˆλŠ” μœ„μΉ˜μ˜ μ—¬λŸ¬ μœ μ „μ²΄λ³€μ΄λ₯Ό ν•˜λ‚˜λ‘œ ν•©μ³μ„œ λΆ„μ„ν•˜λŠ” 방법듀이 μ œμ•ˆλ˜μ—ˆλ‹€. 버든 κ²€μ •(burden test), 뢄산ꡬ쑰 κ²€μ •(variance component test), κ²°ν•© μ˜΄λ‹ˆλ²„μŠ€ κ²€μ •(combined omnibus test) λ“±μ˜ μœ„μΉ˜κΈ°λ°˜ μ—°κ΄€ 뢄석이 λ°”λ‘œ 그것이닀. 희귀변이 연관뢄석에 μœ„μ™€ 같은 뢄석방법을 ν™œμš©ν•˜λ©΄ κ²€μ •λ ₯이 크게 μ¦κ°€ν•˜μ—¬ 더 λ§Žμ€ μ§ˆλ³‘μ—°κ΄€ μœ μ „μ²΄ 변이λ₯Ό λ°œκ²¬ν•  수 μžˆμ„ κ²ƒμœΌλ‘œ κΈ°λŒ€λ˜μ–΄μ™”λ‹€. ν•˜μ§€λ§Œ μƒ˜ν”Œ κ°„ μœ μ „μ  μ΄μ§ˆμ„±μ˜ μ‘΄μž¬μ™€ μƒλŒ€μ μœΌλ‘œ μƒ˜ν”Œ μˆ˜κ°€ 적은 ν•œκ³„λ“€ λ•Œλ¬Έμ— 맀우 적은 수의 변이 만이 λ°œκ²¬λ˜μ—ˆλ‹€. μ΄λŸ¬ν•œ λ¬Έμ œμ μ„ ν•΄κ²°ν•˜κΈ° μœ„ν•΄ λ‹€μ–‘ν•œ 방법듀이 κ°œλ°œλ˜μ—ˆλŠ”λ°, κ·Έ 쀑 ν•˜λ‚˜λŠ” κ°€μ‘±κΈ°λ°˜ 뢄석 λ°©λ²•μœΌλ‘œ μ΄λŠ” μƒ˜ν”Œ κ°„ μœ μ „μ  μ΄μ§ˆμ„±κ³Ό 집단측화 문제λ₯Ό λ‹€λ£¨λŠ”λ° μš©μ΄ν•˜λ‹€. 두 번째둜 μ„œλ‘œ λ‹€λ₯Έ ν‘œν˜„ν˜•μ΄ μ„œλ‘œ 관련이 μžˆμ„ 경우 κ²€μ •λ ₯을 μ¦κ°€μ‹œν‚€κΈ° μœ„ν•΄ 이듀을 ν•œλ²ˆμ— λΆ„μ„ν•˜λŠ” 방법이 μžˆλ‹€. μ„Έ λ²ˆμ§ΈλŠ” 메타뢄석을 ν™œμš©ν•˜μ—¬ μ—¬λŸ¬ μ—°κ΅¬μ˜ κ²°κ³Όλ₯Ό ν•©μΉ˜λŠ” λ°©λ²•μœΌλ‘œ μ΄λŠ” λ§Žμ€ μ—°κ΅¬λ“€μ—μ„œ νš¨κ³Όμ μž„μ΄ λ°ν˜€μ‘Œλ‹€. 이 λ…Όλ¬Έμ—μ„œλŠ” ν˜„μž¬ 많이 μ‚¬μš©λ˜κ³  μžˆλŠ” μ—¬λŸ¬ κ°€μ‘±κΈ°λ°˜ 희귀변이 μ—°κ΄€ 뢄석 방법을 λΉ„κ΅ν•˜μ˜€κ³  λ‹€λ₯Έ 방법듀에 λΉ„ν•΄ FARVAT 이 ν†΅κ³„μ μœΌλ‘œ κ²¬κ³ ν•˜λ©° 계산 효율적인 λ°©λ²•μž„μ„ λ³΄μ˜€λ‹€. 더 λ‚˜μ•„κ°€ 이λ₯Ό 닀쀑 ν‘œν˜„ν˜• 뢄석 방법(mFARVAT)κ³Ό 메타뢄석 방법(metaFARVAT)으둜 ν™•μž₯ν•˜μ˜€λ‹€. mFARVAT은 μœ μ‚¬μš°λ„ν•¨μˆ˜ 기반 μŠ€μ½”μ–΄ ν…ŒμŠ€νŠΈ(quasi-likelihood-based score test)λ₯Ό λ‹€μˆ˜μ˜ ν‘œν˜„ν˜•μ— μ μš©ν•˜λŠ” ν¬κ·€μ§ˆν™˜ 연관뢄석 λ°©λ²•μœΌλ‘œ ν‘œν˜„ν˜•λ“€μ— λŒ€ν•œ 각 λ³€μ΄μ˜ λ™μ§ˆμ„± 및 μ΄μ§ˆμ„± 효과λ₯Ό κ²€μ¦ν•œλ‹€. metaFARVAT은 μ—¬λŸ¬ μ—°κ΅¬μ—μ„œμ˜ μœ λ„ν•¨μˆ˜ μŠ€μ½”μ–΄λ₯Ό κ²°ν•©ν•˜μ—¬ 버든 ν†΅κ³„λŸ‰, 변이 μž„κ³„(variable threshold) ν†΅κ³„λŸ‰, 뢄산ꡬ쑰 ν†΅κ³„λŸ‰, κ²°ν•© μ˜΄λ‹ˆλ²„μŠ€ ν†΅κ³„λŸ‰μ„ μƒμ„±ν•œλ‹€. μ΄λŠ” μ—¬λŸ¬ μ—°κ΅¬λ“€μ˜ κ²°κ³Όλ₯Ό μ΄μš©ν•˜μ—¬ λ³€μ΄λ“€μ˜ λ™μ§ˆμ„± 및 μ΄μ§ˆμ„± 효과λ₯Ό κ²€μ¦ν•˜λ©°, μ •λŸ‰ ν‘œν˜„ν˜• 및 이뢄 ν‘œν˜„ν˜•μ— 적용이 κ°€λŠ₯ν•˜λ‹€. λ‹€μ–‘ν•œ μ‹œλ‚˜λ¦¬μ˜€ ν•˜μ—μ„œμ˜ κ΄‘λ²”μœ„ν•œ λͺ¨μ˜ μ‹€ν—˜μ„ 톡해 μ œμ•ˆν•œ 방법듀이 일반적으둜 κ²¬κ³ ν•˜κ³  νš¨μœ¨μ μ΄λΌλŠ” 것을 λ³΄μ˜€λ‹€. λ˜ν•œ 이 방법을 ν™œμš©ν•˜μ—¬DLEC1 λ“±μ˜ λ§Œμ„±νμ‡„μ„±νμ§ˆν™˜(COPD) κ΄€λ ¨ 후보 μœ μ „μžλ₯Ό λ°œκ²¬ν•˜μ˜€λ‹€.Despite of tens of thousands of genome wide association studies (GWASs), the so-called missing heritability reveals that analyses of common variants identified only a limited number of disease susceptibility loci and a substantial amount of causal variants remain undiscovered by GWASs. Sequencing technology was expected to supply this additional information by obtaining large stretches of DNA spanning the entire genome, and improvements in this technology have enabled genetic association analysis of rare/common causal variants. However, single variant association tests commonly used by GWAS result in false negative findings unless very large samples are available. Alternatively, aggregation of association signals across multiple genetic variants in a biology relevant region is expected to boost statistical power for rare variant analysis. Numerous statistical methods have been proposed for region-based rare variant association studies, such as burden, variance component, and combined omnibus tests. Region-based association tests are expected to substantially improve statistical power for rare variant analyses and to identify additional disease susceptibility loci. However, very few significant results have been identified due to genetic heterogeneity and relatively small sample sizes. To address the limitations, various approaches have been developed. First, family-based designs play an important role in controlling genetic heterogeneity and population stratification. Second, disease status are often diagnosed by the outcomes of different but related phenotypes, and thus multiple phenotype analysis is supposed to provide additional information and increase power. Third, for the small sample issue, combining results from multiple studies using meta-analysis has been repeatedly addressed as an effective strategy. In this study, I compared the performance of a selection of the popular family-based rare variant association tests and found FARVAT is the most statistically robust and computationally efficient method. Besides, I extended FARVAT for multiple phenotype analysis (mFARVAT), and meta-analysis (metaFARVAT). mFARVAT is a quasi-likelihood-based score test for rare variant association analysis with multiple phenotypes, and tests both homogeneous and heterogeneous effects of each variant on multiple phenotypes. metaFARVAT combines quasi-likelihood scores from multiple studies and generates burden, variable threshold, variance component, and combined omnibus test statistics. metaFARVAT tests homogeneous and heterogeneous genetic effects of variants among different studies and can be applied to both quantitative and dichotomous phenotypes. With extensive simulation studies under various scenarios, I found that the proposed methods are generally robust and efficient with different underlying genetic architectures, and I identified some promising candidate genes associated with chronic obstructive pulmonary disease, including DLEC1.Abstract i Contents iv List of Figures vii List of Tables viii 1 Introduction 1 1.1 The background on rare variant association studies 1 1.1.1 Overview of rare variant association studies 1 1.1.2 Challenges of rare variant association studies 8 1.2 Purpose of this study 12 1.3 Outline of the thesis 15 2 Overview of family-based rare variant association tests 16 2.1 Overview of family-based association studies 16 2.2 Comparison of the selected family-based rare variant association tests 21 2.2.1 Rare Variant Transmission Disequilibrium Test (RV-TDT) 24 2.2.2 Generalized Estimating Equations based Kernel Machine test (GEE-KM) 25 2.2.3 Combined Multivariate and Collapsing test for Pedigrees (PedCMC) 26 2.2.4 Gene-level kernel and burden tests for Pedigrees (PedGene) 27 2.2.5 FAmily-based Rare Variant Association Test (FARVAT) 28 2.2.6 Comparison of the methods with GAW19 data 30 2.3 Conclusions 38 3 Family-based Rare Variant Association Test for Multivariate Phenotypes 39 3.1 Introduction 39 3.2 Methods 40 3.2.1 Notations and the disease model 40 3.2.2 Choice of offset 42 3.2.3 Score for quasi-likelihood 43 3.2.4 Homogeneous mFARVAT 44 3.2.5 Heterogeneous mFARVAT 47 3.3 Simulation study 51 3.3.1 The simulation model 51 3.3.2 Evaluation of mFARVAT with simulated data 55 3.4 Application to COPD data 78 3.5 Discussion 85 4 Family-based Rare Variant Association Test for Meta-analysis 90 4.1 Introduction 90 4.2 Methods 92 4.2.1 Notation 92 4.2.2 Choices of Offset 93 4.2.3 Score for Quasi-likelihood 94 4.2.4 Homogeneous Model 95 4.2.5 Heterogeneous Model 98 4.3 Simulation study 101 4.3.1 The simulation model 101 4.3.2 Evaluation of metaFARVAT with simulated data 104 4.4 Application to COPD data 124 4.5 Discussion 132 5 Summary & Conclusions 145 Bibliography 149 Abstract (Korean) 156Docto

    Multi view based imaging genetics analysis on Parkinson disease

    Get PDF
    Longitudinal studies integrating imaging and genetic data have recently become widespread among bioinformatics researchers. Combining such heterogeneous data allows a better understanding of complex diseases origins and causes. Through a multi-view based workflow proposal, we show the common steps and tools used in imaging genetics analysis, interpolating genotyping, neuroimaging and transcriptomic data. We describe the advantages of existing methods to analyze heterogeneous datasets, using Parkinson\u2019s Disease (PD) as a case study. Parkinson's disease is associated with both genetic and neuroimaging factors, however such imaging genetics associations are at an early investigation stage. Therefore it is desirable to have a free and open source workflow that integrates different analysis flows in order to recover potential genetic biomarkers in PD, as in other complex diseases
    • …
    corecore