121 research outputs found

    Bioinformatics challenges for genome-wide association studies

    Get PDF
    Motivation: The sequencing of the human genome has made it possible to identify an informative set of >1 million single nucleotide polymorphisms (SNPs) across the genome that can be used to carry out genome-wide association studies (GWASs). The availability of massive amounts of GWAS data has necessitated the development of new biostatistical methods for quality control, imputation and analysis issues including multiple testing. This work has been successful and has enabled the discovery of new associations that have been replicated in multiple studies. However, it is now recognized that most SNPs discovered via GWAS have small effects on disease susceptibility and thus may not be suitable for improving health care through genetic testing. One likely explanation for the mixed results of GWAS is that the current biostatistical analysis paradigm is by design agnostic or unbiased in that it ignores all prior knowledge about disease pathobiology. Further, the linear modeling framework that is employed in GWAS often considers only one SNP at a time thus ignoring their genomic and environmental context. There is now a shift away from the biostatistical approach toward a more holistic approach that recognizes the complexity of the genotype–phenotype relationship that is characterized by significant heterogeneity and gene–gene and gene–environment interaction. We argue here that bioinformatics has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases. The goal of this review is to identify and discuss those GWAS challenges that will require computational methods

    FAM-MDR: A Flexible Family-Based Multifactor Dimensionality Reduction Technique to Detect Epistasis Using Related Individuals

    Get PDF
    We propose a novel multifactor dimensionality reduction method for epistasis detection in small or extended pedigrees, FAM-MDR. It combines features of the Genome-wide Rapid Association using Mixed Model And Regression approach (GRAMMAR) with Model-Based MDR (MB-MDR). We focus on continuous traits, although the method is general and can be used for outcomes of any type, including binary and censored traits. When comparing FAM-MDR with Pedigree-based Generalized MDR (PGMDR), which is a generalization of Multifactor Dimensionality Reduction (MDR) to continuous traits and related individuals, FAM-MDR was found to outperform PGMDR in terms of power, in most of the considered simulated scenarios. Additional simulations revealed that PGMDR does not appropriately deal with multiple testing and consequently gives rise to overly optimistic results. FAM-MDR adequately deals with multiple testing in epistasis screens and is in contrast rather conservative, by construction. Furthermore, simulations show that correcting for lower order (main) effects is of utmost importance when claiming epistasis. As Type 2 Diabetes Mellitus (T2DM) is a complex phenotype likely influenced by gene-gene interactions, we applied FAM-MDR to examine data on glucose area-under-the-curve (GAUC), an endophenotype of T2DM for which multiple independent genetic associations have been observed, in the Amish Family Diabetes Study (AFDS). This application reveals that FAM-MDR makes more efficient use of the available data than PGMDR and can deal with multi-generational pedigrees more easily. In conclusion, we have validated FAM-MDR and compared it to PGMDR, the current state-of-the-art MDR method for family data, using both simulations and a practical dataset. FAM-MDR is found to outperform PGMDR in that it handles the multiple testing issue more correctly, has increased power, and efficiently uses all available information

    Dissecting Trait Heterogeneity: a Comparison of Three Clustering Methods Applied to Genotypic Data

    Get PDF
    Background: Trait heterogeneity, which exists when a trait has been defined with insufficient specificity such that it is actually two or more distinct traits, has been implicated as a confounding factor in traditional statistical genetics of complex hu man disease. In the absence of de tailed phenotypic data collected consistently in combination with genetic data, unsupervised computational methodologies offer the potential for discovering underlying trait heteroge neity. The performance of three such methods – Bayesian Classification, Hyperg raph-Based Clustering, and Fuzzy k -Modes Clustering – appropriate for categorical data were comp ared. Also tested was the ability of these methods to detect trait heterogeneity in the presence of locus heteroge neity and/or gene-gene interaction , which are two other complicating factors in discovering genetic models of complex human disease. To dete rmine the efficacy of applying the Bayesian Classification method to re al data, the reliability of its intern al clustering metr ics at finding good clusterings was evaluated using permutation testing. Results: Bayesian Classifica tion outperformed the other two method s, with the exception that the Fuzzy k -Modes Clustering performed best on the most comp lex genetic model. Bayesian Classificati on achieved excellent recovery for 75% of the da tasets simulated under the simplest genetic model, while it achieved moderate recovery for 56% of datase ts with a sample size of 500 or more (across all simulated models) and for 86% of datasets with 10 or fewer nonfuncti onal loci (across all si mulated models). Neither Hypergraph Clustering nor Fuzzy k -Modes Clustering achieved good or excellent cluster recovery for a majority of datasets even under a re stricted set of conditions. When usin g the average log of class strength as the internal clustering metric, th e false positive rate was controlled very well, at three percent or less for all three significance levels (0. 01, 0.05, 0.10), and the false negative rate was acceptably low (18 percent) for the least stringent sign ificance level of 0.10. Conclusion: Bayesian Classificati on shows promise as an unsuper vised computational method for dissecting trait hetero geneity in genotypic data. Its control of fa lse positive and false negative rates lends confidence to the validity of its results. Further investigation of how differ ent parameter settings may improve the performance of Bayesian Classification, especi ally under more comp lex genetic models, is ongoing

    Nonparametric inference for classification and association with high dimensional genetic data

    Get PDF

    Information Theory in Computational Biology: Where We Stand Today

    Get PDF
    "A Mathematical Theory of Communication" was published in 1948 by Claude Shannon to address the problems in the field of data compression and communication over (noisy) communication channels. Since then, the concepts and ideas developed in Shannon's work have formed the basis of information theory, a cornerstone of statistical learning and inference, and has been playing a key role in disciplines such as physics and thermodynamics, probability and statistics, computational sciences and biological sciences. In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational biology-gene expression and transcriptomics, alignment-free sequence comparison, sequencing and error correction, genome-wide disease-gene association mapping, metabolic networks and metabolomics, and protein sequence, structure and interaction analysis
    corecore