6,815 research outputs found
A Quadratically Regularized Functional Canonical Correlation Analysis for Identifying the Global Structure of Pleiotropy with NGS Data
Investigating the pleiotropic effects of genetic variants can increase
statistical power, provide important information to achieve deep understanding
of the complex genetic structures of disease, and offer powerful tools for
designing effective treatments with fewer side effects. However, the current
multiple phenotype association analysis paradigm lacks breadth (number of
phenotypes and genetic variants jointly analyzed at the same time) and depth
(hierarchical structure of phenotype and genotypes). A key issue for high
dimensional pleiotropic analysis is to effectively extract informative internal
representation and features from high dimensional genotype and phenotype data.
To explore multiple levels of representations of genetic variants, learn their
internal patterns involved in the disease development, and overcome critical
barriers in advancing the development of novel statistical methods and
computational algorithms for genetic pleiotropic analysis, we proposed a new
framework referred to as a quadratically regularized functional CCA (QRFCCA)
for association analysis which combines three approaches: (1) quadratically
regularized matrix factorization, (2) functional data analysis and (3)
canonical correlation analysis (CCA). Large-scale simulations show that the
QRFCCA has a much higher power than that of the nine competing statistics while
retaining the appropriate type 1 errors. To further evaluate performance, the
QRFCCA and nine other statistics are applied to the whole genome sequencing
dataset from the TwinsUK study. We identify a total of 79 genes with rare
variants and 67 genes with common variants significantly associated with the 46
traits using QRFCCA. The results show that the QRFCCA substantially outperforms
the nine other statistics.Comment: 64 pages including 12 figure
Sparse Probit Linear Mixed Model
Linear Mixed Models (LMMs) are important tools in statistical genetics. When
used for feature selection, they allow to find a sparse set of genetic traits
that best predict a continuous phenotype of interest, while simultaneously
correcting for various confounding factors such as age, ethnicity and
population structure. Formulated as models for linear regression, LMMs have
been restricted to continuous phenotypes. We introduce the Sparse Probit Linear
Mixed Model (Probit-LMM), where we generalize the LMM modeling paradigm to
binary phenotypes. As a technical challenge, the model no longer possesses a
closed-form likelihood function. In this paper, we present a scalable
approximate inference algorithm that lets us fit the model to high-dimensional
data sets. We show on three real-world examples from different domains that in
the setup of binary labels, our algorithm leads to better prediction accuracies
and also selects features which show less correlation with the confounding
factors.Comment: Published version, 21 pages, 6 figure
κ°μ‘± κΈ°λ° ν¬κ· λ³μ΄ μ°κ΄ λΆμμ μν λΆμ μκ³ λ¦¬μ¦ κ°λ°
νμλ
Όλ¬Έ(λ°μ¬)--μμΈλνκ΅ λνμ :μμ°κ³Όνλν νλκ³Όμ μλ¬Όμ 보νμ 곡,2019. 8. μμ±νΈ.μλ§μ μ μ₯μ μ 체μ°κ΄λΆμ(GWAS)μλ λΆκ΅¬νκ³ μ§λ³μ°κ΄ μ μ 체λ³μ΄(DSL)λ μ νμ μΌλ‘λ§ λ°κ²¬λμλλ° μ΄λ μ€μ’
λ μ§λ³μ μ μ±(missing heritability)μ κΈ°μΈνλ€. ν λ²μ κΈ΄ 리λ(read)λ₯Ό μνμ±νλ κΈ°μ μ μ΄λ₯Ό 보μν΄ μ€ κ²μΌλ‘ κΈ°λλμ΄ μμΌλ©°, μ΄ κΈ°μ μ λ°λ¬ λλΆμ μ μ 체μ°κ΄λΆμμ νμ©νμ¬ μ¬λ¬ ν¬κ·(rare) λ° μΌλ°(common) μΈκ³Ό λ³μ΄λ₯Ό λ°κ²¬ν μ μμλ€. κ·Έλ¬λ κ½€ λ§μ μνμ μ΄μ©ν μ€νμμλ λ¨μΌ λ³μ΄λ₯Ό λμμΌλ‘ν μ μ₯μ μ 체μ°κ΄λΆμμ λΆμ μ€λ₯(false negative) λ¬Έμ μμ μμ λ‘μΈ μ μλ€. μ΄μ ν¬κ·λ³μ΄ μ°κ΄ λΆμμ κ²μ λ ₯μ μ¦κ°μν€κΈ° μν΄ μλ¬Όνμ μΌλ‘ μ°κ΄μ΄ μλ μμΉμ μ¬λ¬ μ μ 체λ³μ΄λ₯Ό νλλ‘ ν©μ³μ λΆμνλ λ°©λ²λ€μ΄ μ μλμλ€. λ²λ κ²μ (burden test), λΆμ°κ΅¬μ‘° κ²μ (variance component test), κ²°ν© μ΄λλ²μ€ κ²μ (combined omnibus test) λ±μ μμΉκΈ°λ° μ°κ΄ λΆμμ΄ λ°λ‘ κ·Έκ²μ΄λ€.
ν¬κ·λ³μ΄ μ°κ΄λΆμμ μμ κ°μ λΆμλ°©λ²μ νμ©νλ©΄ κ²μ λ ₯μ΄ ν¬κ² μ¦κ°νμ¬ λ λ§μ μ§λ³μ°κ΄ μ μ 체 λ³μ΄λ₯Ό λ°κ²¬ν μ μμ κ²μΌλ‘ κΈ°λλμ΄μλ€. νμ§λ§ μν κ° μ μ μ μ΄μ§μ±μ μ‘΄μ¬μ μλμ μΌλ‘ μν μκ° μ μ νκ³λ€ λλ¬Έμ λ§€μ° μ μ μμ λ³μ΄ λ§μ΄ λ°κ²¬λμλ€. μ΄λ¬ν λ¬Έμ μ μ ν΄κ²°νκΈ° μν΄ λ€μν λ°©λ²λ€μ΄ κ°λ°λμλλ°, κ·Έ μ€ νλλ κ°μ‘±κΈ°λ° λΆμ λ°©λ²μΌλ‘ μ΄λ μν κ° μ μ μ μ΄μ§μ±κ³Ό μ§λ¨μΈ΅ν λ¬Έμ λ₯Ό λ€λ£¨λλ° μ©μ΄νλ€. λ λ²μ§Έλ‘ μλ‘ λ€λ₯Έ νννμ΄ μλ‘ κ΄λ ¨μ΄ μμ κ²½μ° κ²μ λ ₯μ μ¦κ°μν€κΈ° μν΄ μ΄λ€μ νλ²μ λΆμνλ λ°©λ²μ΄ μλ€. μΈ λ²μ§Έλ λ©νλΆμμ νμ©νμ¬ μ¬λ¬ μ°κ΅¬μ κ²°κ³Όλ₯Ό ν©μΉλ λ°©λ²μΌλ‘ μ΄λ λ§μ μ°κ΅¬λ€μμ ν¨κ³Όμ μμ΄ λ°νμ‘λ€.
μ΄ λ
Όλ¬Έμμλ νμ¬ λ§μ΄ μ¬μ©λκ³ μλ μ¬λ¬ κ°μ‘±κΈ°λ° ν¬κ·λ³μ΄ μ°κ΄ λΆμ λ°©λ²μ λΉκ΅νμκ³ λ€λ₯Έ λ°©λ²λ€μ λΉν΄ FARVAT μ΄ ν΅κ³μ μΌλ‘ κ²¬κ³ νλ©° κ³μ° ν¨μ¨μ μΈ λ°©λ²μμ 보μλ€. λ λμκ° μ΄λ₯Ό λ€μ€ ννν λΆμ λ°©λ²(mFARVAT)κ³Ό λ©νλΆμ λ°©λ²(metaFARVAT)μΌλ‘ νμ₯νμλ€. mFARVATμ μ μ¬μ°λν¨μ κΈ°λ° μ€μ½μ΄ ν
μ€νΈ(quasi-likelihood-based score test)λ₯Ό λ€μμ νννμ μ μ©νλ ν¬κ·μ§ν μ°κ΄λΆμ λ°©λ²μΌλ‘ νννλ€μ λν κ° λ³μ΄μ λμ§μ± λ° μ΄μ§μ± ν¨κ³Όλ₯Ό κ²μ¦νλ€. metaFARVATμ μ¬λ¬ μ°κ΅¬μμμ μ λν¨μ μ€μ½μ΄λ₯Ό κ²°ν©νμ¬ λ²λ ν΅κ³λ, λ³μ΄ μκ³(variable threshold) ν΅κ³λ, λΆμ°κ΅¬μ‘° ν΅κ³λ, κ²°ν© μ΄λλ²μ€ ν΅κ³λμ μμ±νλ€. μ΄λ μ¬λ¬ μ°κ΅¬λ€μ κ²°κ³Όλ₯Ό μ΄μ©νμ¬ λ³μ΄λ€μ λμ§μ± λ° μ΄μ§μ± ν¨κ³Όλ₯Ό κ²μ¦νλ©°, μ λ ννν λ° μ΄λΆ νννμ μ μ©μ΄ κ°λ₯νλ€. λ€μν μλλ¦¬μ€ νμμμ κ΄λ²μν λͺ¨μ μ€νμ ν΅ν΄ μ μν λ°©λ²λ€μ΄ μΌλ°μ μΌλ‘ κ²¬κ³ νκ³ ν¨μ¨μ μ΄λΌλ κ²μ 보μλ€. λν μ΄ λ°©λ²μ νμ©νμ¬DLEC1 λ±μ λ§μ±νμμ±νμ§ν(COPD) κ΄λ ¨ ν보 μ μ μλ₯Ό λ°κ²¬νμλ€.Despite of tens of thousands of genome wide association studies (GWASs), the so-called missing heritability reveals that analyses of common variants identified only a limited number of disease susceptibility loci and a substantial amount of causal variants remain undiscovered by GWASs. Sequencing technology was expected to supply this additional information by obtaining large stretches of DNA spanning the entire genome, and improvements in this technology have enabled genetic association analysis of rare/common causal variants. However, single variant association tests commonly used by GWAS result in false negative findings unless very large samples are available. Alternatively, aggregation of association signals across multiple genetic variants in a biology relevant region is expected to boost statistical power for rare variant analysis. Numerous statistical methods have been proposed for region-based rare variant association studies, such as burden, variance component, and combined omnibus tests.
Region-based association tests are expected to substantially improve statistical power for rare variant analyses and to identify additional disease susceptibility loci. However, very few significant results have been identified due to genetic heterogeneity and relatively small sample sizes. To address the limitations, various approaches have been developed. First, family-based designs play an important role in controlling genetic heterogeneity and population stratification. Second, disease status are often diagnosed by the outcomes of different but related phenotypes, and thus multiple phenotype analysis is supposed to provide additional information and increase power. Third, for the small sample issue, combining results from multiple studies using meta-analysis has been repeatedly addressed as an effective strategy.
In this study, I compared the performance of a selection of the popular family-based rare variant association tests and found FARVAT is the most statistically robust and computationally efficient method. Besides, I extended FARVAT for multiple phenotype analysis (mFARVAT), and meta-analysis (metaFARVAT). mFARVAT is a quasi-likelihood-based score test for rare variant association analysis with multiple phenotypes, and tests both homogeneous and heterogeneous effects of each variant on multiple phenotypes. metaFARVAT combines quasi-likelihood scores from multiple studies and generates burden, variable threshold, variance component, and combined omnibus test statistics. metaFARVAT tests homogeneous and heterogeneous genetic effects of variants among different studies and can be applied to both quantitative and dichotomous phenotypes. With extensive simulation studies under various scenarios, I found that the proposed methods are generally robust and efficient with different underlying genetic architectures, and I identified some promising candidate genes associated with chronic obstructive pulmonary disease, including DLEC1.Abstract i
Contents iv
List of Figures vii
List of Tables viii
1 Introduction 1
1.1 The background on rare variant association studies 1
1.1.1 Overview of rare variant association studies 1
1.1.2 Challenges of rare variant association studies 8
1.2 Purpose of this study 12
1.3 Outline of the thesis 15
2 Overview of family-based rare variant association tests 16
2.1 Overview of family-based association studies 16
2.2 Comparison of the selected family-based rare variant association tests 21
2.2.1 Rare Variant Transmission Disequilibrium Test (RV-TDT) 24
2.2.2 Generalized Estimating Equations based Kernel Machine test (GEE-KM) 25
2.2.3 Combined Multivariate and Collapsing test for Pedigrees (PedCMC) 26
2.2.4 Gene-level kernel and burden tests for Pedigrees (PedGene) 27
2.2.5 FAmily-based Rare Variant Association Test (FARVAT) 28
2.2.6 Comparison of the methods with GAW19 data 30
2.3 Conclusions 38
3 Family-based Rare Variant Association Test for Multivariate Phenotypes 39
3.1 Introduction 39
3.2 Methods 40
3.2.1 Notations and the disease model 40
3.2.2 Choice of offset 42
3.2.3 Score for quasi-likelihood 43
3.2.4 Homogeneous mFARVAT 44
3.2.5 Heterogeneous mFARVAT 47
3.3 Simulation study 51
3.3.1 The simulation model 51
3.3.2 Evaluation of mFARVAT with simulated data 55
3.4 Application to COPD data 78
3.5 Discussion 85
4 Family-based Rare Variant Association Test for Meta-analysis 90
4.1 Introduction 90
4.2 Methods 92
4.2.1 Notation 92
4.2.2 Choices of Offset 93
4.2.3 Score for Quasi-likelihood 94
4.2.4 Homogeneous Model 95
4.2.5 Heterogeneous Model 98
4.3 Simulation study 101
4.3.1 The simulation model 101
4.3.2 Evaluation of metaFARVAT with simulated data 104
4.4 Application to COPD data 124
4.5 Discussion 132
5 Summary & Conclusions 145
Bibliography 149
Abstract (Korean) 156Docto
Multi view based imaging genetics analysis on Parkinson disease
Longitudinal studies integrating imaging and genetic data have recently become widespread among bioinformatics researchers. Combining such heterogeneous data allows a better understanding of complex diseases origins and causes. Through a multi-view based workflow proposal, we show the common steps and tools used in imaging genetics analysis, interpolating genotyping, neuroimaging and transcriptomic data. We describe the advantages of existing methods to analyze heterogeneous datasets, using Parkinson\u2019s Disease (PD) as a case study. Parkinson's disease is associated with both genetic and neuroimaging factors, however such imaging genetics associations are at an early investigation stage. Therefore it is desirable to have a free and open source workflow that integrates different analysis flows in order to recover potential genetic biomarkers in PD, as in other complex diseases
- β¦