93 research outputs found

    Vestige: Maximum likelihood phylogenetic footprinting

    Get PDF
    BACKGROUND: Phylogenetic footprinting is the identification of functional regions of DNA by their evolutionary conservation. This is achieved by comparing orthologous regions from multiple species and identifying the DNA regions that have diverged less than neutral DNA. Vestige is a phylogenetic footprinting package built on the PyEvolve toolkit that uses probabilistic molecular evolutionary modelling to represent aspects of sequence evolution, including the conventional divergence measure employed by other footprinting approaches. In addition to measuring the divergence, Vestige allows the expansion of the definition of a phylogenetic footprint to include variation in the distribution of any molecular evolutionary processes. This is achieved by displaying the distribution of model parameters that represent partitions of molecular evolutionary substitutions. Examination of the spatial incidence of these effects across regions of the genome can identify DNA segments that differ in the nature of the evolutionary process. RESULTS: Vestige was applied to a reference dataset of the SCL locus from four species and provided clear identification of the known conserved regions in this dataset. To demonstrate the flexibility to use diverse models of molecular evolution and dissect the nature of the evolutionary process Vestige was used to footprint the Ka/Ks ratio in primate BRCA1 with a codon model of evolution. Two regions of putative adaptive evolution were identified illustrating the ability of Vestige to represent the spatial distribution of distinct molecular evolutionary processes. CONCLUSION: Vestige provides a flexible, open platform for phylogenetic footprinting. Underpinned by the PyEvolve toolkit, Vestige provides a framework for visualising the signatures of evolutionary processes across the genome of numerous organisms simultaneously. By exploiting the maximum-likelihood statistical framework, the complex interplay between mutational processes, DNA repair and selection can be evaluated both spatially (along a sequence alignment) and temporally (for each branch of the tree) providing visual indicators to the attributes and functions of DNA sequences

    Statistical methods for detecting periodic fragments in DNA sequence data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Period 10 dinucleotides are structurally and functionally validated factors that influence the ability of DNA to form nucleosomes, histone core octamers. Robust identification of periodic signals in DNA sequences is therefore required to understand nucleosome organisation in genomes. While various techniques for identifying periodic components in genomic sequences have been proposed or adopted, the requirements for such techniques have not been considered in detail and confirmatory testing for a priori specified periods has not been developed.</p> <p>Results</p> <p>We compared the estimation accuracy and suitability for confirmatory testing of autocorrelation, discrete Fourier transform (DFT), integer period discrete Fourier transform (IPDFT) and a previously proposed Hybrid measure. A number of different statistical significance procedures were evaluated but a blockwise bootstrap proved superior. When applied to synthetic data whose period-10 signal had been eroded, or for which the signal was approximately period-10, the Hybrid technique exhibited superior properties during exploratory period estimation. In contrast, confirmatory testing using the blockwise bootstrap procedure identified IPDFT as having the greatest statistical power. These properties were validated on yeast sequences defined from a ChIP-chip study where the Hybrid metric confirmed the expected dominance of period-10 in nucleosome associated DNA but IPDFT identified more significant occurrences of period-10. Application to the whole genomes of yeast and mouse identified ~ 21% and ~ 19% respectively of these genomes as spanned by period-10 nucleosome positioning sequences (NPS).</p> <p>Conclusions</p> <p>For estimating the dominant period, we find the Hybrid period estimation method empirically to be the most effective for both eroded and approximate periodicity. The blockwise bootstrap was found to be effective as a significance measure, performing particularly well in the problem of period detection in the presence of eroded periodicity. The autocorrelation method was identified as poorly suited for use with the blockwise bootstrap. Application of our methods to the genomes of two model organisms revealed a striking proportion of the yeast and mouse genomes are spanned by NPS. Despite their markedly different sizes, roughly equivalent proportions (19-21%) of the genomes lie within period-10 spans of the NPS dinucleotides {<it>AA, TT, TA</it>}. The biological significance of these regions remains to be demonstrated. To facilitate this, the genomic coordinates are available as Additional files 1, 2, and 3 in a format suitable for visualisation as tracks on popular genome browsers.</p> <p>Reviewers</p> <p>This article was reviewed by Prof Tomas Radivoyevitch, Dr Vsevolod Makeev (nominated by Dr Mikhail Gelfand), and Dr Rob D Knight.</p

    Species abundance information improves sequence taxonomy classification accuracy.

    Get PDF
    Popular naive Bayes taxonomic classifiers for amplicon sequences assume that all species in the reference database are equally likely to be observed. We demonstrate that classification accuracy degrades linearly with the degree to which that assumption is violated, and in practice it is always violated. By incorporating environment-specific taxonomic abundance information, we demonstrate a significant increase in the species-level classification accuracy across common sample types. At the species level, overall average error rates decline from 25% to 14%, which is favourably comparable to the error rates that existing classifiers achieve at the genus level (16%). Our findings indicate that for most practical purposes, the assumption that reference species are equally likely to be observed is untenable. q2-clawback provides a straightforward alternative for samples from common environments

    Pitfalls of the most commonly used models of context dependent substitution

    Get PDF
    Correction to Lindsay H, Yap VB, Ying H, Huttley GA: Pitfalls of the most commonly used models of context dependent substitution. Biology Direct 2008, 3:5

    Pathological rate matrices: from primates to pathogens

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Continuous-time Markov models allow flexible, parametrically succinct descriptions of sequence divergence. Non-reversible forms of these models are more biologically realistic but are challenging to develop. The instantaneous rate matrices defined for these models are typically transformed into substitution probability matrices using a matrix exponentiation algorithm that employs eigendecomposition, but this algorithm has characteristic vulnerabilities that lead to significant errors when a rate matrix possesses certain 'pathological' properties. Here we tested whether pathological rate matrices exist in nature, and consider the suitability of different algorithms to their computation.</p> <p>Results</p> <p>We used concatenated protein coding gene alignments from microbial genomes, primate genomes and independent intron alignments from primate genomes. The Taylor series expansion and eigendecomposition matrix exponentiation algorithms were compared to the less widely employed, but more robust, Padé with scaling and squaring algorithm for nucleotide, dinucleotide, codon and trinucleotide rate matrices. Pathological dinucleotide and trinucleotide matrices were evident in the microbial data set, affecting the eigendecomposition and Taylor algorithms respectively. Even using a conservative estimate of matrix error (occurrence of an invalid probability), both Taylor and eigendecomposition algorithms exhibited substantial error rates: ~100% of all exonic trinucleotide matrices were pathological to the Taylor algorithm while ~10% of codon positions 1 and 2 dinucleotide matrices and intronic trinucleotide matrices, and ~30% of codon matrices were pathological to eigendecomposition. The majority of Taylor algorithm errors derived from occurrence of multiple unobserved states. A small number of negative probabilities were detected from the Pad�� algorithm on trinucleotide matrices that were attributable to machine precision. Although the Padé algorithm does not facilitate caching of intermediate results, it was up to 3× faster than eigendecomposition on the same matrices.</p> <p>Conclusion</p> <p>Development of robust software for computing non-reversible dinucleotide, codon and higher evolutionary models requires implementation of the Padé with scaling and squaring algorithm.</p

    Infection with a Virulent Strain of Wolbachia Disrupts Genome Wide-Patterns of Cytosine Methylation in the Mosquito Aedes aegypti

    Get PDF
    BACKGROUND Cytosine methylation is one of several reversible epigenetic modifications of DNA that allow a greater flexibility in the relationship between genotype and phenotype. Methylation in the simplest models dampens gene expression by modifying regions of DNA critical for transcription factor binding. The capacity to methylate DNA is variable in the insects due to diverse histories of gene loss and duplication of DNA methylases. Mosquitoes like Drosophila melanogaster possess only a single methylase, DNMT2. DESCRIPTION Here we characterise the methylome of the mosquito Aedes aegypti and examine its relationship to transcription and test the effects of infection with a virulent strain of the endosymbiont Wolbachia on the stability of methylation patterns. CONCLUSION We see that methylation in the A. aegypti genome is associated with reduced transcription and is most common in the promoters of genes relating to regulation of transcription and metabolism. Similar gene classes are also methylated in aphids and honeybees, suggesting either conservation or convergence of methylation patterns. In addition to this evidence of evolutionary stability, we also show that infection with the virulent wMelPop Wolbachia strain induces additional methylation and demethylation events in the genome. While most of these changes seem random with respect to gene function and have no detected effect on transcription, there does appear to be enrichment of genes associated with membrane function. Given that Wolbachia lives within a membrane-bound vacuole of host origin and retains a large number of genes for transporting host amino acids, inorganic ions and ATP despite a severely reduced genome, these changes might represent an evolved strategy for manipulating the host environments for its own gain. Testing for a direct link between these methylation changes and expression, however, will require study across a broader range of developmental stages and tissues with methods that detect splice variants.This research was supported by The National Health and Medical Research Council of Australia. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

    Species abundance information improves sequence taxonomy classification accuracy

    Get PDF
    Popular naive Bayes taxonomic classifiers for amplicon sequences assume that all species in the reference database are equally likely to be observed. We demonstrate that classification accuracy degrades linearly with the degree to which that assumption is violated, and in practice it is always violated. By incorporating environment-specific taxonomic abundance information, we demonstrate a significant increase in the species-level classification accuracy across common sample types. At the species level, overall average error rates decline from 25% to 14%, which is favourably comparable to the error rates that existing classifiers achieve at the genus level (16%). Our findings indicate that for most practical purposes, the assumption that reference species are equally likely to be observed is untenable. q2-clawback provides a straightforward alternative for samples from common environments.QIIME 2 development was primarily funded by NSF Awards 1565100 to J.G.C. and 1565057 to R.K. This work was supported by an NHMRC project grant APP1085372, awarded to G.A.H., J.G.C., and R.K

    Detecting coevolution without phylogenetic trees? Tree-ignorant metrics of coevolution perform as well as tree-aware metrics

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Identifying coevolving positions in protein sequences has myriad applications, ranging from understanding and predicting the structure of single molecules to generating proteome-wide predictions of interactions. Algorithms for detecting coevolving positions can be classified into two categories: tree-aware, which incorporate knowledge of phylogeny, and tree-ignorant, which do not. Tree-ignorant methods are frequently orders of magnitude faster, but are widely held to be insufficiently accurate because of a confounding of shared ancestry with coevolution. We conjectured that by using a null distribution that appropriately controls for the shared-ancestry signal, tree-ignorant methods would exhibit equivalent statistical power to tree-aware methods. Using a novel t-test transformation of coevolution metrics, we systematically compared four tree-aware and five tree-ignorant coevolution algorithms, applying them to myoglobin and myosin. We further considered the influence of sequence recoding using reduced-state amino acid alphabets, a common tactic employed in coevolutionary analyses to improve both statistical and computational performance.</p> <p>Results</p> <p>Consistent with our conjecture, the transformed tree-ignorant metrics (particularly Mutual Information) often outperformed the tree-aware metrics. Our examination of the effect of recoding suggested that charge-based alphabets were generally superior for identifying the stabilizing interactions in alpha helices. Performance was not always improved by recoding however, indicating that the choice of alphabet is critical.</p> <p>Conclusion</p> <p>The results suggest that t-test transformation of tree-ignorant metrics can be sufficient to control for patterns arising from shared ancestry.</p

    Loss of ACTN3 gene function alters mouse muscle metabolism and shows evidence of positive selection in humans

    Get PDF
    More than a billion humans worldwide are predicted to be completely deficient in the fast skeletal muscle fiber protein α-actinin-3 owing to homozygosity for a premature stop codon polymorphism, R577X, in the ACTN3 gene. The R577X polymorphism is associ
    corecore