6,042 research outputs found

    ํŒจ์Šค์›จ์ด ์ •๋ณด๋ฅผ ์ด์šฉํ•œ ๋Œ€์šฉ๋Ÿ‰ ์œ ์ „์ฒด ์ž๋ฃŒ์˜ ํ†ต๊ณ„์  ๋ถ„์„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ํ˜‘๋™๊ณผ์ • ์ƒ๋ฌผ์ •๋ณดํ•™์ „๊ณต, 2018. 2. ๋ฐ•ํƒœ์„ฑ.hence our method considers the correlation of pathways and handles an entire dataset in a single model. In addition, PHARAOH-multi further extends the original model into multivariate analysis, while keeping the advantages of our previous approach. We extend PHARAOH to enable analysis of multiple traits using hierarchical components of genetic variants. In addition, PHARAOH-multi can identify associations between multiple phenotypes and multiple pathways, with a single model, in the presence of subsequent genes within pathways, as a hierarchy. Through simulation studies, PHARAOH was shown to have higher statistical power than the existing pathway-based methods. In addition, a detailed simulation study for PHARAOH-multi demonstrated advantages of multivariate analysis, compared to univariate analysis, and comparison studies showed the proposed approach to outperform existing multivariate pathway-based methods. Finally, we conducted an analysis of whole-exome sequencing data from a Korean population study to compare the performance between the proposed methods with the previous pathway-based methods, using validated pathway databases. As a result, PHARAOH successfully discovered 13 pathways for the liver enzymes, and PHARAOH-multi identified 8 pathways for multiple metabolic traits. Through a replication study using an independent, large-scale exome chip dataset, we replicated many pathways that were discovered by the proposed methods and showed their biological relationship to the target traits.In the past two decades, rapid advances in DNA sequencing technology have enabled extensive investigations into human genetic architecture, especially for the identification of genetic variants associated with complex traits. In particular, genome-wide association studies (GWAS) have played a key role in identifying genetic associations between Single Nucleotide Variants (SNVs) and many complex biological pathologies. However, the genetic variants identified by many successful GWAS have explained only a modest part of heritability for most of phenotypes, and many hypotheses have been proposed to address so-called missing heritability issue, such as rare variant association, gene-gene interaction or multi-omics integration. Methods for rare variants analysis arose from extending individual variant-level approaches to those at the gene-level, and extending those at the gene level to multiple phenotypes. In this trend, as the number of publicly available biological resources is increasing, recent methods for analyzing rare variants utilize pathway knowledge as a priori information. In this respect, many statistical methods for pathway-based analyses using rare variants have been proposed to analyze pathways individually. However, neglecting correlations between multiple pathways can result in misleading solutions, and pathway-based analyses of large-scale genetic datasets require massive computational burden. Moreover, while a number of methods for pathway-based rare-variant analysis of multiple phenotypes have been proposed, no method considers a unified model that incorporate multiple pathways. In this thesis, we propose novel statistical methods to analyze large-scale genetic dataset using pathway information, Pathway-based approach using HierArchical components of collapsed RAre variants Of High-throughput sequencing data (PHARAOH) and PHARAOH-multi. PHARAOH extends generalized structural component analysis, and implements the method based on the framework of generalized linear models, to accommodate phenotype data arising from a variety of exponential family distributions. PHARAOH constructs a single hierarchical model that consists of collapsed gene-level summaries and pathways, and analyzes entire pathways simultaneously by imposing ridge-type penalties on both gene and pathway coefficient estimatesIntroduction 1 1.1. The background on genetic association studies 1 1.1.1. Genome-wide association studies and the missing heritability 1 1.1.2. Rare variant analyses 6 1.2. The purpose of this study 10 1.3. Outline of the thesis 12 An overview of existing methods 13 2.1. Review of pathway-based methods 13 2.2.1. Competitive and self-contained tests: WKS and DRB 16 2.2.2. Self-contained test: aSPU 19 2.2.3. Self-contained test: MARV 21 2.3. Generalized structured component analysis 23 2.3.1. The model 23 2.3.2. Parameter estimation 25 Pathway-based approach using rare variants 27 3.1. Introduction 27 3.2. Methods 29 3.2.1. Notations and the model 29 3.2.2. An exemplary structure 32 3.3.3. Parameter estimation 33 3.4. Simulation study 37 3.4.1. The simulation dataset 37 3.4.2. Comparison of methods using simulation dataset 38 3.5. Application to analysis of liver enzymes 44 3.5.1. Whole exome sequencing dataset for pathway discovery 44 3.5.2. Replication study using exome chip dataset 53 3.6. Discussion 56 Multivariate pathway-based approach using rare variants 60 4.1. Introduction 60 4.2. Methods 61 4.2.1. Notations and the model 61 4.2.2. An exemplary structure 63 4.2.3. Parameter estimation 66 4.2.4. Significance testing 69 4.2.5. Multiple testing correction 75 4.3. Simulation study 77 4.3.1. The simulation model 74 4.3.2. Evaluation with simulated data 88 4.4. Application to the real datasets 88 4.4.1. Real data discovery from whole-exome sequencing dataset 95 4.4.2. Replication study using independent exome chip dataset 98 4.5. Discussion 99 Summary & Conclusions 104 Bibliography 108 ์ดˆ ๋ก 127Docto

    Pathway Analysis Approaches for Rare and Common Variants: Insights From Genetic Analysis Workshop 18

    Get PDF
    Pathway analysis, broadly defined as a group of methods incorporating a priori biological information from public databases, has emerged as a promising approach for analyzing high-dimensional genomic data. As part of Genetic Analysis Workshop 18, seven research groups applied pathway analysis techniques to whole-genome sequence data from the San Antonio Family Study. Overall, the groups found that the potential of pathway analysis to improve detection of causal variants by lowering the multiple-testing burden and incorporating biologic insight remains largely unrealized. Specifically, there is a lack of best practices at each stage of the pathway approach: annotation, analysis, interpretation, and follow-up. Annotation of genetic variants is inconsistent across databases, incomplete, and biased toward known genes. At the analysis stage insufficient statistical power remains a major challenge. Analyses combining rare and common variants may have an inflated type I error rate and may not improve detection of causal genes. Inclusion of known causal genes may not improve statistical power, although the fraction of explained phenotypic variance may be a more appropriate metric. Interpretation of findings is further complicated by evidence in support of interactions between pathways and by the lack of consensus on how to best incorporate functional information. Finally, all presented approaches warranted follow-up studies, both to reduce the likelihood of false-positive findings and to identify specific causal variants within a given pathway. Despite the initial promise of pathway analysis for modeling biological complexity of disease phenotypes, many methodological challenges currently remain to be addressed

    Statistical methods for gene selection and genetic association studies

    Get PDF
    This dissertation includes five Chapters. A brief description of each chapter is organized as follows. In Chapter One, we propose a signed bipartite genotype and phenotype network (GPN) by linking phenotypes and genotypes based on the statistical associations. It provides a new insight to investigate the genetic architecture among multiple correlated phenotypes and explore where phenotypes might be related at a higher level of cellular and organismal organization. We show that multiple phenotypes association studies by considering the proposed network are improved by incorporating the genetic information into the phenotype clustering. In Chapter Two, we first illustrate the proposed GPN to GWAS summary statistics. Then, we assess contributions to constructing a well-defined GPN with a clear representation of genetic associations by comparing the network properties with a random network, including connectivity, centrality, and community structure. The network topology annotations based on the sparse representations of GPN can be used to understand the disease heritability for the highly correlated phenotypes. In applications of phenome-wide association studies, the proposed GPN can identify more significant pairs of genetic variant and phenotype categories. In Chapter Three, a powerful and computationally efficient gene-based association test is proposed, aggregating information from different gene-based association tests and also incorporating expression quantitative trait locus information. We show that the proposed method controls the type I error rates very well and has higher power in the simulation studies and can identify more significant genes in the real data analyses. In Chapter Four, we develop six statistical selection methods based on the penalized regression for inferring target genes of a transcription factor (TF). In this study, the proposed selection methods combine statistics, machine learning , and convex optimization approach, which have great efficacy in identifying the true target genes. The methods will fill the gap of lacking the appropriate methods for predicting target genes of a TF, and are instrumental for validating experimental results yielding from ChIP-seq and DAP-seq, and conversely, selection and annotation of TFs based on their target genes. In Chapter Five, we propose a gene selection approach by capturing gene-level signals in network-based regression into case-control association studies with DNA sequence data or DNA methylation data, inspired by the popular gene-based association tests using a weighted combination of genetic variants to capture the combined effect of individual genetic variants within a gene. We show that the proposed gene selection approach have higher true positive rates than using traditional dimension reduction techniques in the simulation studies and select potentially rheumatoid arthritis related genes that are missed by existing methods

    A clustering linear combination method for multiple phenotype association studies based on GWAS summary statistics

    Get PDF
    There is strong evidence showing that joint analysis of multiple phenotypes in genome-wide association studies (GWAS) can increase statistical power when detecting the association between genetic variants and human complex diseases. We previously developed the Clustering Linear Combination (CLC) method and a computationally efficient CLC (ceCLC) method to test the association between multiple phenotypes and a genetic variant, which perform very well. However, both of these methods require individual-level genotypes and phenotypes that are often not easily accessible. In this research, we develop a novel method called sCLC for association studies of multiple phenotypes and a genetic variant based on GWAS summary statistics. We use the LD score regression to estimate the correlation matrix among phenotypes. The test statistic of sCLC is constructed by GWAS summary statistics and has an approximate Cauchy distribution. We perform a variety of simulation studies and compare sCLC with other commonly used methods for multiple phenotype association studies using GWAS summary statistics. Simulation results show that sCLC can control Type I error rates well and has the highest power in most scenarios. Moreover, we apply the newly developed method to the UK Biobank GWAS summary statistics from the XIII category with 70 related musculoskeletal system and connective tissue phenotypes. The results demonstrate that sCLC detects the most number of significant SNPs, and most of these identified SNPs can be matched to genes that have been reported in the GWAS catalog to be associated with those phenotypes. Furthermore, sCLC also identifies some novel signals that were missed by standard GWAS, which provide new insight into the potential genetic factors of the musculoskeletal system and connective tissue phenotypes

    A Quadratically Regularized Functional Canonical Correlation Analysis for Identifying the Global Structure of Pleiotropy with NGS Data

    Full text link
    Investigating the pleiotropic effects of genetic variants can increase statistical power, provide important information to achieve deep understanding of the complex genetic structures of disease, and offer powerful tools for designing effective treatments with fewer side effects. However, the current multiple phenotype association analysis paradigm lacks breadth (number of phenotypes and genetic variants jointly analyzed at the same time) and depth (hierarchical structure of phenotype and genotypes). A key issue for high dimensional pleiotropic analysis is to effectively extract informative internal representation and features from high dimensional genotype and phenotype data. To explore multiple levels of representations of genetic variants, learn their internal patterns involved in the disease development, and overcome critical barriers in advancing the development of novel statistical methods and computational algorithms for genetic pleiotropic analysis, we proposed a new framework referred to as a quadratically regularized functional CCA (QRFCCA) for association analysis which combines three approaches: (1) quadratically regularized matrix factorization, (2) functional data analysis and (3) canonical correlation analysis (CCA). Large-scale simulations show that the QRFCCA has a much higher power than that of the nine competing statistics while retaining the appropriate type 1 errors. To further evaluate performance, the QRFCCA and nine other statistics are applied to the whole genome sequencing dataset from the TwinsUK study. We identify a total of 79 genes with rare variants and 67 genes with common variants significantly associated with the 46 traits using QRFCCA. The results show that the QRFCCA substantially outperforms the nine other statistics.Comment: 64 pages including 12 figure

    STATISTICAL METHODS FOR GWAS AND THE IMPACT OF DIABETIC MEDICATION ADHERENCE ON HEALTHCARE COSTS

    Get PDF
    This dissertation includes three Chapters. In Chapter One, we develop a computationally efficient clustering linear combination approach to jointly analyze multiple phenotypes for GWAS. In this paper, based on the existing CLC method and ACAT strategy, we develop the ceCLC method to test association between multiple phenotypes and a genetic variant. In Chapter Two, we develop a novel method called sCLC for association studies of multiple phenotypes and a genetic variant based on GWAS summary statistics. Simulation results show that sCLC can control Type I error rates well and has the highest power in most scenarios. In Chapter Three, we investigate the relationship between health service costs (medical cost, pharmacy cost, and total cost) and diabetic medication adherence for patients with diabetes in the UPHP population. This finding indicates that despite higher pharmacy spending, increasing medication adherence can significantly reduce the medical cost. Moreover, medication adherence based on different medicines has different effects on total healthcare cost and medical cost

    JOINT ANALYSIS OF MULTIPLE PHENOTYPES IN ASSOCIATION STUDIES

    Get PDF
    Genome-wide association studies (GWAS) have become a very effective research tool to identify genetic variants of underlying various complex diseases. In spite of the success of GWAS in identifying thousands of reproducible associations between genetic variants and complex disease, in general, the association between genetic variants and a single phenotype is usually weak. It is increasingly recognized that joint analysis of multiple phenotypes can be potentially more powerful than the univariate analysis, and can shed new light on underlying biological mechanisms of complex diseases. Therefore, developing statistical methods to test for genetic association with multiple phenotypes has become increasingly important. This dissertation contains three chapters and the three chapters include three new methods we developed for jointly analyzing multiple phenotypes. In the first chapter of this dissertation, we propose an Adaptive Fisherโ€™s Combination (AFC) method for joint analysis of multiple phenotypes in association studies. The AFC method combines p-values obtained in standard univariate GWAS by using the optimal number of p-values which is determined by the data. In the second chapter, we propose an Allele-Based Clustering (ABC) approach for the joint analysis of multiple non-normal phenotypes in association studies. In the ABC method, we consider the alleles at a SNP of interest as a dependent variable with two classes, and the correlated phenotypes as predictors to predict the alleles at the SNP of interest. In the third chapter, we develop a novel variable reduction method using hierarchical clustering method (HCM) for joint analysis of multiple phenotypes in association studies. HCM involves two steps. The first step applies a dimension reduction technique by using a representative phenotype for each cluster of phenotypes. Then, existing methods are used in the second step to test the association between genetic variants and the representative phenotypes rather than the individual phenotypes. We perform extensive simulations to evaluate performances of AFC, ABC, and HCM methods and compare the powers of our methods with the powers of some existing methods. Our simulation studies show that the proposed methods have correct type I error rates and are either the most powerful test or comparable with the most powerful test. Finally, we illustrate our proposed methodologies AFC and HCM by analyzing whole-genome genotyping data from a lung function study. The results of real data analysis demonstrated that the proposed methods have great potential in GWAS on complex diseases with multiple phenotypes

    Pathway analysis of rare variants for the clustered phenotypes by using hierarchical structured components analysis

    Get PDF
    Backgrounds Recent large-scale genetic studies often involve clustered phenotypes such as repeated measurements. Compared to a series of univariate analyses of single phenotypes, an analysis of clustered phenotypes can be useful for substantially increasing statistical power to detect more genetic associations. Moreover, for the analysis of rare variants, incorporation of biological information can boost weak effects of the rare variants. Results Through simulation studies, we showed that the proposed method outperforms other method currently available for pathway-level analysis of clustered phenotypes. Moreover, a real data analysis using a large-scale whole exome sequencing dataset of 995 samples with metabolic syndrome-related phenotypes successfully identified the glyoxylate and dicarboxylate metabolism pathway that could not be identified by the univariate analyses of single phenotypes and other existing method. Conclusion In this paper, we introduced a novel pathway-level association test by combining hierarchical structured components analysis and penalized generalized estimating equations. The proposed method analyzes all pathways in a single unified model while considering their correlations. C/C++ implementation of PHARAOH-GEE is publicly available at http://statgen.snu.ac.kr/software/pharaoh-gee/.Publication costs are funded by the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI) grant (HI16C2037). Also, this work was supported by the Bio & Medical Technology Development Program of the National Research Foundation of Korea (NRF) grant (2013M3A9C4078158) and by grants of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI16C2037, HI15C2165, HI16C2048)
    • โ€ฆ
    corecore