837 research outputs found

    JBASE: Joint Bayesian Analysis of Subphenotypes and Epistasis

    Get PDF
    Motivation: Rapid advances in genotyping and genome-wide association studies have enabled the discovery of many new genotype–phenotype associations at the resolution of individual markers. However, these associations explain only a small proportion of theoretically estimated heritability of most diseases. In this work, we propose an integrative mixture model called JBASE: joint Bayesian analysis of subphenotypes and epistasis. JBASE explores two major reasons of missing heritability: interactions between genetic variants, a phenomenon known as epistasis and phenotypic heterogeneity, addressed via subphenotyping. Results: Our extensive simulations in a wide range of scenarios repeatedly demonstrate that JBASE can identify true underlying subphenotypes, including their associated variants and their interactions, with high precision. In the presence of phenotypic heterogeneity, JBASE has higher Power and lower Type 1 Error than five state-of-the-art approaches. We applied our method to a sample of individuals from Mexico with Type 2 diabetes and discovered two novel epistatic modules, including two loci each, that define two subphenotypes characterized by differences in body mass index and waist-to-hip ratio. We successfully replicated these subphenotypes and epistatic modules in an independent dataset from Mexico genotyped with a different platform. Availability and implementation: JBASE is implemented in Cþþ, supported on Linux and is available at http://www.cs.toronto.edu/goldenberg/JBASE/jbase.tar.gz. The genotype data underlying this study are available upon approval by the ethics review board of the Medical Centre Siglo XXI.No sponso

    High-throughput computational methods and software for quantitative trait locus (QTL) mapping

    Get PDF
    De afgelopen jaren zijn vele nieuwe technologieen zoals Tiling arrays en High throughput DNA sequencing een belangrijke rol gaan spelen binnen het onderzoeksveld van de systeem genetica. Voor onderzoekers is het extreem belangrijk om te begrijpen dat deze methodes hun manier van werken zullen gaan beinvloeden. Deit proefschrift beschrijft mogelijke oplossingen voor deze 'Big Data' lawine die systemen genetica heeft getroffen.Dit proefschrift beschrijft de werkzaamheden uitgevoerd aan het Groningen Bioinformatics Centre om slimmere en geoptimaliseerde algoritmen zoals Pheno2Geno en MQM te ontwikkelen en een systeem om 'collaborative' research mogelijk te maken genaamd xQTL werkbank om door middel van high-throughput systemen genetica data te analyseren.In recent years many new technologies such as tiling arrays and high-throughput sequencinghave come to play an important role in systems genetics research. For researchers it is ofthe utmost importance to understand how this affects their research. This work describespossible solutions to this ‘Big Data’ avalanche which has hit systems genetics.This thesis describes the work carried out during the author’s 4 year PHD project at theGroningen Bioinformatics Centre to develop smarter and more optimized algorithms suchas Pheno2Geno and MQM, and to use a collaborative approach such as xQTL workbench tostore and analyse high-throughput systems genetics data

    A GPU program to compute SNP-SNP interactions in genome-wide association studies

    Get PDF
    With the recent advances in the next generation sequencing technologies, short read sequences of human genome are made more accessible. Paired end sequencing of short reads is currently the most sensitive method for detecting somatic mutations that arise during tumor development. In this study, a novel approach to optimize the detection of structural variants using a new short read alignment program is presented. Pairwise interaction effects of the Single Nucleotide Polymorphisms (SNPs) have proven to uncover the underlying complex disease traits. Computing the disease risk based on the interaction effects of SNPs on a case - control study is a difficult problem. As another part of the thesis, a fast GPU program that can calculate the chi-square statistics of SNP-SNP interactions and output the significant interacting SNPs is presented. The algorithm is applied to the datasets of seven common diseases obtained from Wellcome Trust Case Control Consortium (WTCCC). The algorithm computed the significant SNP pairs much faster than the existing algorithms and also identifies 3 significant pairs associated with genes IL23R and C11orf30 which are associated with pathogenesis in the Crohns disease dataset

    Empirical Investigations OF RNA Fitness Landscapes: Harnessing the Power of High-Throughput Sequencing and Evolutionary Simulations

    Get PDF
    Fitness landscapes or adaptive landscapes represent the mapping of genotype (sequence) to phenotype (function or fitness). Originally proposed as a metaphor to envision evolutionary processes and mutational interactions, the fitness landscape has recently transitioned from theoretical to empirical. This is due in part to advances in DNA synthesis and high-throughput sequencing. This allows for the construction and analysis of empirical fitness landscapes that encompass thousands of genotypes. These landscapes provide tractable insight into mutational pathways, the predictability of evolution or even the evolution of life. RNA enzymes (ribozymes) are an attractive model system for the construction of empirical fitness landscapes. Ribozymes function as both a genotype (primary RNA sequence) and a phenotype (catalytic function). To construct and characterize empirical RNA fitness landscapes, two high-throughput functional assays (self-cleavage and self-ligation), including a technique to improve data recovery from high-throughput sequencing using phased nucleotide inserts (Appendix A), were developed and implemented. Following fitness landscape construction, a stochastic evolutionary model was developed and employed based on the Wright-Fisher model. This model follows the principles of Darwinian evolution and allows a population to explore the fitness landscape by means of mutation and selection. These newly developed tools allowed for a novel approach to important evolutionary questions. Chapter 1 explored the evolution of innovation at the intersection of two ribozyme functions: self-cleavage and self-ligation. Evolutionary innovations are qualitatively novel traits that emerge through evolution. Theories have suggested that innovations can occur where two genotype networks are in close proximity. However, only isolated examples of intersections have been investigated. The fitness landscape between the two ribozyme functions was explored by determining the ability of numerous neighboring RNA sequences to catalyze two different chemical reactions. This revealed that there was extensive functional overlap, and over half the genotypes can catalyze both functions to some extent. Data-driven evolutionary simulations found that these numerous points of intersection facilitated the discovery of a new function, yet the rate of optimization depended upon the starting location in the genotype network. This study constructed a fitness landscape where genotype networks intersect and uncovered the implications for evolutionary innovations. Chapter 2 determined the effect of higher sequence space complexity and dimensionality on evolutionary adaptation in RNA fitness landscapes. The complexity and dimensionality of landscapes scale with the length of the RNA molecule. For this study, complexity was defined as the size of the genotype space and dimensionality as the number of edges connecting each genotype (node) to other genotypes that differ by a single mutation. Low-dimensional ‘direct’ landscapes consisting of only two possible nucleotides at various positions were compared to higher-dimensional ‘indirect’ landscapes that had all four nucleotides at the same positions. Indirect pathways contributed to the ruggedness and navigability of landscapes. Increased dimensionality in RNA fitness landscapes had the potential to circumvent fitness valleys, however indirect pathways also harbored stasis genotypes isolated by reciprocal sign epistasis. Chapter 3 applied ancestral sequence resurrection and fitness landscape construction to naturally evolved ribozymes. The CPEB3 ribozyme is highly conserved in mammals and has been linked to episodic memory. By predicting, ‘resurrecting’ and functionally characterizing ancient gene sequences, hypotheses about gene function or selection can be empirically tested in an evolutionary context. Using the extant ribozyme sequences found in a range of mammalian species as a basis for inference of ancestral sequences, a phylogenetic fitness landscape was experimentally resurrected and reconstructed. A single high-activity ancestral sequence was found to be highly conserved and purifying selection is expected to have reduced the accumulation of mutations through geologic time. Many of the extant mammalian ribozyme sequences had high ribozyme activity, however a few had relatively low activity. Yet, given the local fitness landscape, a selective pressure for functional ribozyme sequences was seen. A single nucleotide polymorphism (SNP) found in humans, reduced co-transcriptional ribozyme activity in vitro and might alter our understanding of the CPEB3 ribozyme’s biological function. Chapter 4 analyzed epistatic interactions in four published RNA fitness landscapes generated from high-throughput analyses. Two of the landscapes were assessed in vivo and two were assessed in vitro. Epistasis occurs when the effects of some mutations are dependent on the presence or absence of other mutations. The data allowed for an analysis of the distribution of fitness effects of individual mutations as well as combinations of two or more mutations. Two different approaches to measuring epistasis in the data both revealed a predominance of negative epistasis, such that higher combinations of two or more mutations are typically lower in fitness than expected from the effect of each individual mutation. This finding differed from studies using computationally predicted RNA but is similar to mutational experiments in protein enzymes. The work presented here represents a significant contribution to our ability to construct and empirically characterize RNA fitness landscapes. The development of two high-throughput ribozyme assays opens the door for further empirical landscape construction. The implementation of data-driven stochastic evolutionary modeling allows for a clearer evolutionary characterization of the landscape. Understanding the connection between genotype and phenotype in RNA systems is important for designing RNA functions, improving in vitro selections and understanding the origins and evolution of new RNA functions (innovations). Applying these advances yielded valuable information about evolutionary innovations, the effects of higher dimensionality, evolution of extant ribozymes and the prevalence of epistasis in RNA fitness landscapes. Construction and analysis of empirical RNA fitness landscapes provides tractable insight into evolutionary processes, mutational pathways and the predictability of evolution

    A phylogenetic method to perform genome-wide association studies in microbes

    Get PDF
    Genome-Wide Association Studies (GWAS) are designed to perform an unbiased search of genetic sequence data with the intent of identifying statistically significant associations with a phenotype or trait of interest. The application of GWAS methods to microbial organisms promises to improve the way we understand, manage, and treat infectious diseases. Yet, while microbial pathogens continue to undermine human health, wealth, and longevity, microbial GWAS methods remain unable to fully capitalise on the growing wealth of bacterial and viral genetic sequence data. Clonal population structure and homologous recombination in microbial organisms make it difficult for existing GWAS methods to achieve both the precision needed to reject false positive findings and the statistical power required to detect genuine associations between microbial genotypic and phenotypic variants. In this thesis, we investigate potential solutions to the most substantial methodological challenges in microbial GWAS, and we introduce a new phylogenetic GWAS approach that has been specifically designed for use in bacterial samples. In presenting our approach, we describe the features that render it robust to the confounding effects of both population structure and recombination, while maintaining high statistical power to detect associations. Our approach is applicable to organisms ranging from purely clonal to frequently recombining, to sequence data from both the core and accessory genome, and to binary, categorical, and continuous phenotypes. We also describe the efforts taken to make our method efficient, scalable, and accessible in its implementation within the open-source R package we have created, called treeWAS. Next, we apply our GWAS method to simulated datasets. We develop multiple frameworks for simulating genotypic and phenotypic data with control over relevant parameters. We then present the results of our simulation study, and we use thorough performance testing to demonstrate the power and specificity of our approach, as compared to the performance of alternative cluster-based and dimension-reduction methods. Our approach is then applied to three empirical datasets, from Neisseria gonorrhoeae and Neisseria meningitidis, where we identify core SNPs associated with binary drug resistance and continuous antibiotic minimum inhibitory concentration phenotypes, as well as both core SNP and accessory genome associations with invasive and commensal phenotypes. These applications illustrate the versatility and potential of our method, demonstrating in each case that our approach is capable of confirming known resistance- or virulence-associated loci and discovering novel associations. Our thesis concludes with a review of the previous chapters and an evaluation of the strengths and limitations displayed by the current implementation of our phylogenetic approach to association testing. We discuss key areas for further development, and we propose potential solutions to advance the development of microbial GWAS in future work.Open Acces

    Statistical perspectives on dependencies between genomic markers

    Get PDF
    To study the genetic impact on a quantitative trait, molecular markers are used as predictor variables in a statistical model. This habilitation thesis elucidated challenges accompanied with such investigations. First, the usefulness of including different kinds of genetic effects, which can be additive or non-additive, was verified. Second, dependencies between markers caused by their proximity on the genome were studied in populations with family stratification. The resulting covariance matrix deserved special attention due to its multi-functionality in several fields of genomic evaluations

    Experimental Illumination of Comprehensive Fitness Landscapes: A Dissertation

    Get PDF
    Evolution is the single cohesive logical framework in which all biological processes may exist simultaneously. Incremental changes in phenotype over imperceptibly large timescales have given rise to the enormous diversity of life we witness on earth both presently and through the natural record. The basic unit of evolution is mutation, and by perturbing biological processes, mutations may alter the fitness of an individual. However, the fitness effect of a mutation is difficult to infer from historical record, and complex to obtain experimentally in an efficient and accurate manner. We have recently developed a high throughput method to iteratively mutagenize regions of essential genes in yeast and subsequently analyze individual mutant fitness termed Exceedingly Methodical and Parallel Investigation of Randomized Individual Codons (EMPIRIC). Utilizing this technique as exemplified in Chapters II and III, it is possible to determine the fitness effects of all possible point mutations in parallel through growth competition followed by a high throughput sequencing readout. We have employed this technique to determine the distribution of fitness effects in a nine amino acid region of the Hsp90 gene of S. cerevisiae under elevated temperature, and found the bimodal distribution of fitness effects to be remarkably consistent with near-neutral theory. Comparing the measured fitness effects of mutants to the natural record, phylogenetic alignments appear to be a poor predictor of experimental fitness. In Chapter IV, to further interrogate the properties of this region, library competition under conditions of elevated temperature and salinity were performed to study the potential of protein adaptation. Strikingly, whereas both optimal and elevated temperatures produced no statistically significant beneficial mutations, under conditions of elevated salinity, adaptive mutations appear with fitness advantages up to 8% greater than wild type. Of particular interest, mutations conferring fitness benefits under conditions of elevated salinity almost always experience a fitness defect in other experimental conditions, indicating these mutations are environmentally specialized. Applying the experimental fitness measurements to long standing theoretical predictions of adaptation, our results are remarkably consistent with Fisher’s Geometric Model of protein evolution. Epistasis between mutations can have profound effects on evolutionary trajectories. Although the importance of epistasis has been realized since the early 1900s, the interdependence of mutations is difficult to study in vivo due to the stochastic and constant nature of background mutations. In Chapter V, utilizing the EMPIRIC methodology allows us to study the distribution of fitness effects in the context of mutant genetic backgrounds with minimal influence from unintended background mutations. By analyzing intragenic epistatic interactions, we uncovered a complex interplay between solvent shielded structural residues and solvent exposed hydrophobic surface in the amino acid 582-590 region of Hsp90. Additionally, negative epistasis appears to be negatively correlated with mutational promiscuity while additive interactions are positively correlated, indicating potential avenues for proteins to navigate fitness ‘valleys’. In summary, the work presented in this dissertation is focused on applying experimental context to the theory-rich field of evolutionary biology. The development and implementation of a novel methodology for the rapid and accurate assessment of organismal fitness has allowed us to address some of the most basic processes of evolution including adaptation and protein expression level. Through the work presented here and by investigators across the world, the application of experimental data to evolutionary theory has the potential to improve drug design and human health in general, as well as allow for predictive medicine in the coming era of personalized medicine
    • …
    corecore