154 research outputs found

    Vestige: Maximum likelihood phylogenetic footprinting

    Get PDF
    BACKGROUND: Phylogenetic footprinting is the identification of functional regions of DNA by their evolutionary conservation. This is achieved by comparing orthologous regions from multiple species and identifying the DNA regions that have diverged less than neutral DNA. Vestige is a phylogenetic footprinting package built on the PyEvolve toolkit that uses probabilistic molecular evolutionary modelling to represent aspects of sequence evolution, including the conventional divergence measure employed by other footprinting approaches. In addition to measuring the divergence, Vestige allows the expansion of the definition of a phylogenetic footprint to include variation in the distribution of any molecular evolutionary processes. This is achieved by displaying the distribution of model parameters that represent partitions of molecular evolutionary substitutions. Examination of the spatial incidence of these effects across regions of the genome can identify DNA segments that differ in the nature of the evolutionary process. RESULTS: Vestige was applied to a reference dataset of the SCL locus from four species and provided clear identification of the known conserved regions in this dataset. To demonstrate the flexibility to use diverse models of molecular evolution and dissect the nature of the evolutionary process Vestige was used to footprint the Ka/Ks ratio in primate BRCA1 with a codon model of evolution. Two regions of putative adaptive evolution were identified illustrating the ability of Vestige to represent the spatial distribution of distinct molecular evolutionary processes. CONCLUSION: Vestige provides a flexible, open platform for phylogenetic footprinting. Underpinned by the PyEvolve toolkit, Vestige provides a framework for visualising the signatures of evolutionary processes across the genome of numerous organisms simultaneously. By exploiting the maximum-likelihood statistical framework, the complex interplay between mutational processes, DNA repair and selection can be evaluated both spatially (along a sequence alignment) and temporally (for each branch of the tree) providing visual indicators to the attributes and functions of DNA sequences

    Statistical methods for detecting periodic fragments in DNA sequence data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Period 10 dinucleotides are structurally and functionally validated factors that influence the ability of DNA to form nucleosomes, histone core octamers. Robust identification of periodic signals in DNA sequences is therefore required to understand nucleosome organisation in genomes. While various techniques for identifying periodic components in genomic sequences have been proposed or adopted, the requirements for such techniques have not been considered in detail and confirmatory testing for a priori specified periods has not been developed.</p> <p>Results</p> <p>We compared the estimation accuracy and suitability for confirmatory testing of autocorrelation, discrete Fourier transform (DFT), integer period discrete Fourier transform (IPDFT) and a previously proposed Hybrid measure. A number of different statistical significance procedures were evaluated but a blockwise bootstrap proved superior. When applied to synthetic data whose period-10 signal had been eroded, or for which the signal was approximately period-10, the Hybrid technique exhibited superior properties during exploratory period estimation. In contrast, confirmatory testing using the blockwise bootstrap procedure identified IPDFT as having the greatest statistical power. These properties were validated on yeast sequences defined from a ChIP-chip study where the Hybrid metric confirmed the expected dominance of period-10 in nucleosome associated DNA but IPDFT identified more significant occurrences of period-10. Application to the whole genomes of yeast and mouse identified ~ 21% and ~ 19% respectively of these genomes as spanned by period-10 nucleosome positioning sequences (NPS).</p> <p>Conclusions</p> <p>For estimating the dominant period, we find the Hybrid period estimation method empirically to be the most effective for both eroded and approximate periodicity. The blockwise bootstrap was found to be effective as a significance measure, performing particularly well in the problem of period detection in the presence of eroded periodicity. The autocorrelation method was identified as poorly suited for use with the blockwise bootstrap. Application of our methods to the genomes of two model organisms revealed a striking proportion of the yeast and mouse genomes are spanned by NPS. Despite their markedly different sizes, roughly equivalent proportions (19-21%) of the genomes lie within period-10 spans of the NPS dinucleotides {<it>AA, TT, TA</it>}. The biological significance of these regions remains to be demonstrated. To facilitate this, the genomic coordinates are available as Additional files 1, 2, and 3 in a format suitable for visualisation as tracks on popular genome browsers.</p> <p>Reviewers</p> <p>This article was reviewed by Prof Tomas Radivoyevitch, Dr Vsevolod Makeev (nominated by Dr Mikhail Gelfand), and Dr Rob D Knight.</p

    Species abundance information improves sequence taxonomy classification accuracy.

    Get PDF
    Popular naive Bayes taxonomic classifiers for amplicon sequences assume that all species in the reference database are equally likely to be observed. We demonstrate that classification accuracy degrades linearly with the degree to which that assumption is violated, and in practice it is always violated. By incorporating environment-specific taxonomic abundance information, we demonstrate a significant increase in the species-level classification accuracy across common sample types. At the species level, overall average error rates decline from 25% to 14%, which is favourably comparable to the error rates that existing classifiers achieve at the genus level (16%). Our findings indicate that for most practical purposes, the assumption that reference species are equally likely to be observed is untenable. q2-clawback provides a straightforward alternative for samples from common environments

    Pitfalls of the most commonly used models of context dependent substitution

    Get PDF
    Correction to Lindsay H, Yap VB, Ying H, Huttley GA: Pitfalls of the most commonly used models of context dependent substitution. Biology Direct 2008, 3:5

    Estimates of the effect of natural selection on protein-coding content

    Get PDF
    Analysis of natural selection is key to understanding many core biological processes, including the emergence of competition, cooperation, and complexity, and has important applications in the targeted development of vaccines. Selection is hard to observe directly but can be inferred from molecular sequence variation. For protein-coding nucleotide sequences, the ratio of nonsynonymous to synonymous substitutions (ω) distinguishes neutrally evolving sequences (ω = 1) from those subjected to purifying (ω 1) selection. We show that current models used to estimate ω are substantially biased by naturally occurring sequence compositions. We present a novel model that weights substitutions by conditional nucleotide frequencies and which escapes these artifacts. Applying it to the genomes of pathogens causing malaria, leprosy, tuberculosis, and Lyme disease gave significant discrepancies in estimates with ∼10-30% of genes affected. Our work has substantial implications for how vaccine targets are chosen and for studying the molecular basis of adaptive evolution

    Pathological rate matrices: from primates to pathogens

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Continuous-time Markov models allow flexible, parametrically succinct descriptions of sequence divergence. Non-reversible forms of these models are more biologically realistic but are challenging to develop. The instantaneous rate matrices defined for these models are typically transformed into substitution probability matrices using a matrix exponentiation algorithm that employs eigendecomposition, but this algorithm has characteristic vulnerabilities that lead to significant errors when a rate matrix possesses certain 'pathological' properties. Here we tested whether pathological rate matrices exist in nature, and consider the suitability of different algorithms to their computation.</p> <p>Results</p> <p>We used concatenated protein coding gene alignments from microbial genomes, primate genomes and independent intron alignments from primate genomes. The Taylor series expansion and eigendecomposition matrix exponentiation algorithms were compared to the less widely employed, but more robust, Padé with scaling and squaring algorithm for nucleotide, dinucleotide, codon and trinucleotide rate matrices. Pathological dinucleotide and trinucleotide matrices were evident in the microbial data set, affecting the eigendecomposition and Taylor algorithms respectively. Even using a conservative estimate of matrix error (occurrence of an invalid probability), both Taylor and eigendecomposition algorithms exhibited substantial error rates: ~100% of all exonic trinucleotide matrices were pathological to the Taylor algorithm while ~10% of codon positions 1 and 2 dinucleotide matrices and intronic trinucleotide matrices, and ~30% of codon matrices were pathological to eigendecomposition. The majority of Taylor algorithm errors derived from occurrence of multiple unobserved states. A small number of negative probabilities were detected from the Pad�� algorithm on trinucleotide matrices that were attributable to machine precision. Although the Padé algorithm does not facilitate caching of intermediate results, it was up to 3× faster than eigendecomposition on the same matrices.</p> <p>Conclusion</p> <p>Development of robust software for computing non-reversible dinucleotide, codon and higher evolutionary models requires implementation of the Padé with scaling and squaring algorithm.</p

    Machine Learning Techniques for Classifying the Mutagenic Origins of Point Mutations

    Get PDF
    There is increasing interest in developing diagnostics that discriminate individual mutagenic mechanisms in a range of applications that include identifying population-specific mutagenesis and resolving distinct mutation signatures in cancer samples. Analyses for these applications assume that mutagenic mechanisms have a distinct relationship with neighboring bases that allows them to be distinguished. Direct support for this assumption is limited to a small number of simple cases, e.g., CpG hypermutability. We have evaluated whether the mechanistic origin of a point mutation can be resolved using only sequence context for a more complicated case. We contrasted single nucleotide variants originating from the multitude of mutagenic processes that normally operate in the mouse germline with those induced by the potent mutagen N-ethyl-N-nitrosourea (ENU). The considerable overlap in the mutation spectra of these two samples make this a challenging problem. Employing a new, robust log-linear modeling method, we demonstrate that neighboring bases contain information regarding point mutation direction that differs between the ENU-induced and spontaneous mutation variant classes. A logistic regression classifier exhibited strong performance at discriminating between the different mutation classes. Concordance between the feature set of the best classifier and information content analyses suggest our results can be generalized to other mutation classification problems. We conclude that machine learning can be used to build a practical classification tool to identify the mutation mechanism for individual genetic variants. Software implementing our approach is freely available under an open-source license

    Infection with a Virulent Strain of Wolbachia Disrupts Genome Wide-Patterns of Cytosine Methylation in the Mosquito Aedes aegypti

    Get PDF
    BACKGROUND Cytosine methylation is one of several reversible epigenetic modifications of DNA that allow a greater flexibility in the relationship between genotype and phenotype. Methylation in the simplest models dampens gene expression by modifying regions of DNA critical for transcription factor binding. The capacity to methylate DNA is variable in the insects due to diverse histories of gene loss and duplication of DNA methylases. Mosquitoes like Drosophila melanogaster possess only a single methylase, DNMT2. DESCRIPTION Here we characterise the methylome of the mosquito Aedes aegypti and examine its relationship to transcription and test the effects of infection with a virulent strain of the endosymbiont Wolbachia on the stability of methylation patterns. CONCLUSION We see that methylation in the A. aegypti genome is associated with reduced transcription and is most common in the promoters of genes relating to regulation of transcription and metabolism. Similar gene classes are also methylated in aphids and honeybees, suggesting either conservation or convergence of methylation patterns. In addition to this evidence of evolutionary stability, we also show that infection with the virulent wMelPop Wolbachia strain induces additional methylation and demethylation events in the genome. While most of these changes seem random with respect to gene function and have no detected effect on transcription, there does appear to be enrichment of genes associated with membrane function. Given that Wolbachia lives within a membrane-bound vacuole of host origin and retains a large number of genes for transporting host amino acids, inorganic ions and ATP despite a severely reduced genome, these changes might represent an evolved strategy for manipulating the host environments for its own gain. Testing for a direct link between these methylation changes and expression, however, will require study across a broader range of developmental stages and tissues with methods that detect splice variants.This research was supported by The National Health and Medical Research Council of Australia. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

    Species abundance information improves sequence taxonomy classification accuracy

    Get PDF
    Popular naive Bayes taxonomic classifiers for amplicon sequences assume that all species in the reference database are equally likely to be observed. We demonstrate that classification accuracy degrades linearly with the degree to which that assumption is violated, and in practice it is always violated. By incorporating environment-specific taxonomic abundance information, we demonstrate a significant increase in the species-level classification accuracy across common sample types. At the species level, overall average error rates decline from 25% to 14%, which is favourably comparable to the error rates that existing classifiers achieve at the genus level (16%). Our findings indicate that for most practical purposes, the assumption that reference species are equally likely to be observed is untenable. q2-clawback provides a straightforward alternative for samples from common environments.QIIME 2 development was primarily funded by NSF Awards 1565100 to J.G.C. and 1565057 to R.K. This work was supported by an NHMRC project grant APP1085372, awarded to G.A.H., J.G.C., and R.K
    corecore