2,194 research outputs found

    Multivariate Models and Algorithms for Systems Biology

    Get PDF
    Rapid advances in high-throughput data acquisition technologies, such as microarraysand next-generation sequencing, have enabled the scientists to interrogate the expression levels of tens of thousands of genes simultaneously. However, challenges remain in developingeffective computational methods for analyzing data generated from such platforms. In thisdissertation, we address some of these challenges. We divide our work into two parts. Inthe first part, we present a suite of multivariate approaches for a reliable discovery of geneclusters, often interpreted as pathway components, from molecular profiling data with replicated measurements. We translate our goal into learning an optimal correlation structure from replicated complete and incomplete measurements. In the second part, we focus on thereconstruction of signal transduction mechanisms in the signaling pathway components. Wepropose gene set based approaches for inferring the structure of a signaling pathway.First, we present a constrained multivariate Gaussian model, referred to as the informed-case model, for estimating the correlation structure from replicated and complete molecular profiling data. Informed-case model generalizes previously known blind-case modelby accommodating prior knowledge of replication mechanisms. Second, we generalize theblind-case model by designing a two-component mixture model. Our idea is to strike anoptimal balance between a fully constrained correlation structure and an unconstrained one.Third, we develop an Expectation-Maximization algorithm to infer the underlying correlation structure from replicated molecular profiling data with missing (incomplete) measurements.We utilize our correlation estimators for clustering real-world replicated complete and incompletemolecular profiling data sets. The above three components constitute the first partof the dissertation. For the structural inference of signaling pathways, we hypothesize a directed signal pathway structure as an ensemble of overlapping and linear signal transduction events. We then propose two algorithms to reverse engineer the underlying signaling pathway structure using unordered gene sets corresponding to signal transduction events. Throughout we treat gene sets as variables and the associated gene orderings as random.The first algorithm has been developed under the Gibbs sampling framework and the secondalgorithm utilizes the framework of simulated annealing. Finally, we summarize our findingsand discuss possible future directions

    Multivariate Models and Algorithms for Systems Biology

    Get PDF
    Rapid advances in high-throughput data acquisition technologies, such as microarraysand next-generation sequencing, have enabled the scientists to interrogate the expression levels of tens of thousands of genes simultaneously. However, challenges remain in developingeffective computational methods for analyzing data generated from such platforms. In thisdissertation, we address some of these challenges. We divide our work into two parts. Inthe first part, we present a suite of multivariate approaches for a reliable discovery of geneclusters, often interpreted as pathway components, from molecular profiling data with replicated measurements. We translate our goal into learning an optimal correlation structure from replicated complete and incomplete measurements. In the second part, we focus on thereconstruction of signal transduction mechanisms in the signaling pathway components. Wepropose gene set based approaches for inferring the structure of a signaling pathway.First, we present a constrained multivariate Gaussian model, referred to as the informed-case model, for estimating the correlation structure from replicated and complete molecular profiling data. Informed-case model generalizes previously known blind-case modelby accommodating prior knowledge of replication mechanisms. Second, we generalize theblind-case model by designing a two-component mixture model. Our idea is to strike anoptimal balance between a fully constrained correlation structure and an unconstrained one.Third, we develop an Expectation-Maximization algorithm to infer the underlying correlation structure from replicated molecular profiling data with missing (incomplete) measurements.We utilize our correlation estimators for clustering real-world replicated complete and incompletemolecular profiling data sets. The above three components constitute the first partof the dissertation. For the structural inference of signaling pathways, we hypothesize a directed signal pathway structure as an ensemble of overlapping and linear signal transduction events. We then propose two algorithms to reverse engineer the underlying signaling pathway structure using unordered gene sets corresponding to signal transduction events. Throughout we treat gene sets as variables and the associated gene orderings as random.The first algorithm has been developed under the Gibbs sampling framework and the secondalgorithm utilizes the framework of simulated annealing. Finally, we summarize our findingsand discuss possible future directions

    Statistical Methods for Analyzing Multivariate Phenotypes and Detecting Rare Variant Associations

    Get PDF
    This dissertation includes four papers with each distributed in one chapter. In chapter 1, I compared the performance of eight multivariate phenotype association tests. The motivation to conduct this power comparison paper is as follows. For nearly 15 years, genome-wide association studies (GWAS) have been widely used to identify genetic variants associated with human diseases and traits. GWAS typically investigate genetic variants for a predefined phenotype, thus fail to identify weak but important effects. In recent years, many multivariate association tests have been developed. However, there is a lack of comprehensive summary of such kinds of approaches. To fill this important gap, I did this power comparison work. The results show that none of the methods is consistently more powerful than that of others. Relatively more powerful methods are still in large demanding. In chapter 2, I proposed a Weighted Combination of multiple Phenotypes approach (WCmulP) for testing multiple correlated phenotypes and one genetic variant of interest. WCmulP linearly combines the multiple phenotypes with optimal weights such that the score test statistic is maximized. I compare WCmulP with other widely used tests and conduct extensive simulation studies as well as real data analysis to evaluate the performance of these methods. The results show that WCmulP outperforms the compared methods in most of the simulation scenarios and real data analysis. As the availability of electronic health record (EHR), thousands of clinical phenotypes can be measured and collected systematically. As a result, the phenome-wide association studies (PheWAS) emerged to detect variants with a broad spectrum of phenotypes. However, the current PheWAS are intrinsically univariate test, which investigate the phenotype one at a time. Genuine PheWAS that simultaneously test the wide range of phenotypes need to be discovered. In chapter 3, I proposed a novel PheWAS approach, which referred to as PheCLC (PheWAS using clustering linear combination), to examine genetic variation associated with up to thousands of phenotypes. PheCLC jointly analyzes a wide spectrum of human phenotypes as well as classifies them into different categories based on the International Classification of Diseases (ICD) codes. The simulation results show that PheCLC certainly controls type I error rates and is much more powerful than the traditional multivariate approaches. To date, GWAS have published thousands of common variants associated with human diseases. However, these common variants only contribute a small portion of the phenotypic variance. Many studies showed that rare variants could substantially explain missing heritability. In chapter 4, I derived a rare variant association study for family-based designs, where the rare variants can be enriched compared to population-based designs. I applied the proposed method as well as the other two family-based tests to the genetic analysis workshop 19 (GAW19) dataset and the results show that our method can identify more genes with power greater than 40% than the other two methods

    Defining the genetic control of human blood plasma N-glycome using genome-wide association study

    Get PDF
    Glycosylation is a common post-translational modification of proteins. Glycosylation is associated with a number of human diseases. Defining genetic factors altering glycosylation may provide a basis for novel approaches to diagnostic and pharmaceutical applications. Here we report a genome-wide association study of the human blood plasma N-glycome composition in up to 3811 people measured by Ultra Performance Liquid Chromatography (UPLC) technology. Starting with the 36 original traits measured by UPLC, we computed an additional 77 derived traits leading to a total of 113 glycan traits. We studied associations between these traits and genetic polymorphisms located on human autosomes. We discovered and replicated 12 loci. This allowed us to demonstrate an overlap in genetic control between total plasma protein and IgG glycosylation. The majority of revealed loci contained genes that encode enzymes directly involved in glycosylation (FUT3/FUT6, FUT8, B3GAT1, ST6GAL1, B4GALT1, ST3GAL4, MGAT3 and MGAT5) and a known regulator of plasma protein fucosylation (HNF1A). However, we also found loci that could possibly reflect other more complex aspects of glycosylation process. Functional genomic annotation suggested the role of several genes including DERL3, CHCHD10, TMEM121, IGH and IKZF1. The hypotheses we generated may serve as a starting point for further functional studies in this research area

    Gamma-based clustering via ordered means with application to gene-expression analysis

    Full text link
    Discrete mixture models provide a well-known basis for effective clustering algorithms, although technical challenges have limited their scope. In the context of gene-expression data analysis, a model is presented that mixes over a finite catalog of structures, each one representing equality and inequality constraints among latent expected values. Computations depend on the probability that independent gamma-distributed variables attain each of their possible orderings. Each ordering event is equivalent to an event in independent negative-binomial random variables, and this finding guides a dynamic-programming calculation. The structuring of mixture-model components according to constraints among latent means leads to strict concavity of the mixture log likelihood. In addition to its beneficial numerical properties, the clustering method shows promising results in an empirical study.Comment: Published in at http://dx.doi.org/10.1214/10-AOS805 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    STATISTICAL METHODS FOR JOINT ANALYSIS OF MULTIPLE PHENOTYPES AND THEIR APPLICATIONS FOR PHEWAS

    Get PDF
    Genome-wide association studies (GWAS) have successfully detected tens of thousands of robust SNP-trait associations. Earlier researches have primarily focused on association studies of genetic variants and some well-defined functions or phenotypic traits. Emerging evidence suggests that pleiotropy, the phenomenon of one genetic variant affects multiple phenotypes, is widespread, especially in complex human diseases. Therefore, individual phenotype analyses may lose statistical power to identify the underlying genetic mechanism. Contrasting with single phenotype analyses, joint analysis of multiple phenotypes exploits the correlations between phenotypes and aggregates multiple weak marginal effects and is therefore likely to provide new insights into the functional consequences of genetic variations. This dissertation includes two papers, corresponding to two primary research projects I have done during my Ph.D. study, with each distributed in one chapter. Chapter 1 proposed an innovative method, which referred to as HC-CLC, for joint analysis of multipole phenotypes using a Hierarchical Clustering (HC) approach followed by a Clustering Linear Combination (CLC) method. The HC step partitions phenotypes into clusters. The CLC method is then used to test the association between the genetic variant and all phenotypes, which is done by combining individual test statistics while taking full advantage of the clustering information in the HC step. Extensive simulations together with the COPDGene data analysis have been used to assess the Type I error rates and the power of our proposed method. Our simulation results demonstrate that the Type I error rates of HC-CLC are effectively controlled in different realistic settings. HC-CLC either outperforms all other methods or has statistical power that is very close to the most powerful alternative method with which it has been compared. In addition, our real data analysis shows that HC-CLC is an appropriate method for GWAS. Chapter 2 redesigned the PheCLC (Phenome-wide association study that uses the CLC method) which was previously developed by our research group. The refined method is then applied on the UKBiobank data, a large cohort study across the United Kingdom, to test the validity and understand the limitations of the proposed method. We have named our new method UKB-PheCLC. The UKB-PheCLC method is an EHR-based PheWAS. In the first step, it classifies the whole phenome into different phenotypic categories according to the UK Biobank ICD codes. In the second step, the CLC method is applied to each phenotypic category to derive a CLC-based p-value for testing the association between the genetic variant of interest and all phenotypes in that category. In the third step, the CLC-based p-values of all categories are combined by using a strategy resemble that of the Adaptive Fisher\u27s Combination (AFC) method. Overall, UKB-PheCLC harnesses the powerful resource of the UK Biobank and considers the possibility that phenotypes can be grouped into different phenotypic categories, which is very common in EHR-based PheWAS. Moreover, UKB-PheCLC can handle both qualitative and quantitative phenotypes, and it also doesn’t require raw phenotype information. The real data analysis results confirm that UKB-PheCLC is more powerful than the existing methods we have it compared with. Thus, UKB-PheCLC can serve as a compelling method for phenome-wide association study

    Differentially expressed genes match bill morphology and plumage despite largely undifferentiated genomes in a Holarctic songbird

    Get PDF
    © 2015 John Wiley & Sons Ltd. Understanding the patterns and processes that contribute to phenotypic diversity and speciation is a central goal of evolutionary biology. Recently, high-throughput sequencing has provided unprecedented phylogenetic resolution in many lineages that have experienced rapid diversification. The Holarctic redpoll finches (Genus: Acanthis) provide an intriguing example of a recent, phenotypically diverse lineage; traditional sequencing and genotyping methods have failed to detect any genetic differences between currently recognized species, despite marked variation in plumage and morphology within the genus. We examined variation among 20 712 anonymous single nucleotide polymorphisms (SNPs) distributed throughout the redpoll genome in combination with 215 825 SNPs within the redpoll transcriptome, gene expression data and ecological niche modelling to evaluate genetic and ecological differentiation among currently recognized species. Expanding upon previous findings, we present evidence of (i) largely undifferentiated genomes among currently recognized species; (ii) substantial niche overlap across the North American Acanthis range; and (iii) a strong relationship between polygenic patterns of gene expression and continuous phenotypic variation within a sample of redpolls from North America. The patterns we report may be caused by high levels of ongoing gene flow between polymorphic populations, incomplete lineage sorting accompanying very recent or ongoing divergence, variation in cis-regulatory elements, or phenotypic plasticity, but do not support a scenario of prolonged isolation and subsequent secondary contact. Together, these findings highlight ongoing theoretical and computational challenges presented by recent, rapid bouts of phenotypic diversification and provide new insight into the evolutionary dynamics of an intriguing, understudied non-model system. See also the Perspective by Lifjel

    Bayesian testing of many hypotheses ×\times many genes: A study of sleep apnea

    Full text link
    Substantial statistical research has recently been devoted to the analysis of large-scale microarray experiments which provide a measure of the simultaneous expression of thousands of genes in a particular condition. A typical goal is the comparison of gene expression between two conditions (e.g., diseased vs. nondiseased) to detect genes which show differential expression. Classical hypothesis testing procedures have been applied to this problem and more recent work has employed sophisticated models that allow for the sharing of information across genes. However, many recent gene expression studies have an experimental design with several conditions that requires an even more involved hypothesis testing approach. In this paper, we use a hierarchical Bayesian model to address the situation where there are many hypotheses that must be simultaneously tested for each gene. In addition to having many hypotheses within each gene, our analysis also addresses the more typical multiple comparison issue of testing many genes simultaneously. We illustrate our approach with an application to a study of genes involved in obstructive sleep apnea in humans.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS241 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • …
    corecore