9,117 research outputs found

    Identifying statistical dependence in genomic sequences via mutual information estimates

    Get PDF
    Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5' untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI's Combined DNA Index System (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats, an application of importance in genetic profiling.Comment: Preliminary version. Final version in EURASIP Journal on Bioinformatics and Systems Biology. See http://www.hindawi.com/journals/bsb

    Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing

    Get PDF
    We propose a flexible change-point model for inhomogeneous Poisson Processes, which arise naturally from next-generation DNA sequencing, and derive score and generalized likelihood statistics for shifts in intensity functions. We construct a modified Bayesian information criterion (mBIC) to guide model selection, and point-wise approximate Bayesian confidence intervals for assessing the confidence in the segmentation. The model is applied to DNA Copy Number profiling with sequencing data and evaluated on simulated spike-in and real data sets.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS517 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Molecular evolution of the sheep prion protein gene

    Get PDF
    Transmissible spongiform encephalopathies (TSEs) are infectious, fatal neurodegenerative diseases characterized by aggregates of modified forms of the prion protein (PrP) in the central nervous system. Well known examples include variant Creutzfeldt-Jakob Disease (vCJD) in humans, BSE in cattle, chronic wasting disease in deer and scrapie in sheep and goats. In humans, sheep and deer, disease susceptibility is determined by host genotype at the prion protein gene (PRNP). Here I examine the molecular evolution of PRNP in ruminants and show that variation in sheep appears to have been maintained by balancing selection, a profoundly different process from that seen in other ruminants. Scrapie eradication programs such as those recently implemented in the UK, USA and elsewhere are based on the assumption that PRNP is under positive selection in response to scrapie. If, as these data suggest, that assumption is wrong, eradication programs will disrupt this balancing selection, and may have a negative impact on the fitness or scrapie resistance of national flocks

    Recent and Ancient Signature of Balancing Selection around the S-Locus in Arabidopsis halleri and A. lyrata

    Get PDF
    Balancing selection can maintain different alleles over long evolutionary times. Beyond this direct effect on the molecular targets of selection, balancing selection is also expected to increase neutral polymorphism in linked genome regions, in inverse proportion to their genetic map distances from the selected sites. The genes controlling plant self-incompatibility are subject to one of the strongest forms of balancing selection, and they show clear signatures of balancing selection. The genome region containing those genes (the S-locus) is generally described as nonrecombining, and the physical size of the region with low recombination has recently been established in a few species. However, the size of the region showing the indirect footprints of selection due to linkage to the S-locus is only roughly known. Here, we improved estimates of this region by surveying synonymous polymorphism and estimating recombination rates at 12 flanking region loci at known physical distances from the S-locus region boundary, in two closely related self-incompatible plants Arabidopsis halleri and A. lyrata. In addition to studying more loci than previous studies and using known physical distances, we simulated an explicit demographic scenario for the divergence between the two species, to evaluate the extent of the genomic region whose diversity departs significantly from neutral expectations. At the closest flanking loci, we detected signatures of both recent and ancient indirect effects of selection on the S-locus flanking genes, finding ancestral polymorphisms shared by both species, as well as an excess of derived mutations private to either species. However, these effects are detected only in a physically small region, suggesting that recombination in the flanking regions is sufficient to quickly break up linkage disequilibrium with the S-locus. Our approach may be useful for distinguishing cases of ancient versus recently evolved balancing selection in other systems

    Alignment-free Genomic Analysis via a Big Data Spark Platform

    Get PDF
    Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in Computational Biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for Alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE
    corecore