Search CORE

93 research outputs found

Vestige: Maximum likelihood phylogenetic footprinting

Author: Huttley Gavin A
Maxwell Peter
Wakefield Matthew J
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: Phylogenetic footprinting is the identification of functional regions of DNA by their evolutionary conservation. This is achieved by comparing orthologous regions from multiple species and identifying the DNA regions that have diverged less than neutral DNA. Vestige is a phylogenetic footprinting package built on the PyEvolve toolkit that uses probabilistic molecular evolutionary modelling to represent aspects of sequence evolution, including the conventional divergence measure employed by other footprinting approaches. In addition to measuring the divergence, Vestige allows the expansion of the definition of a phylogenetic footprint to include variation in the distribution of any molecular evolutionary processes. This is achieved by displaying the distribution of model parameters that represent partitions of molecular evolutionary substitutions. Examination of the spatial incidence of these effects across regions of the genome can identify DNA segments that differ in the nature of the evolutionary process. RESULTS: Vestige was applied to a reference dataset of the SCL locus from four species and provided clear identification of the known conserved regions in this dataset. To demonstrate the flexibility to use diverse models of molecular evolution and dissect the nature of the evolutionary process Vestige was used to footprint the Ka/Ks ratio in primate BRCA1 with a codon model of evolution. Two regions of putative adaptive evolution were identified illustrating the ability of Vestige to represent the spatial distribution of distinct molecular evolutionary processes. CONCLUSION: Vestige provides a flexible, open platform for phylogenetic footprinting. Underpinned by the PyEvolve toolkit, Vestige provides a framework for visualising the signatures of evolutionary processes across the genome of numerous organisms simultaneously. By exploiting the maximum-likelihood statistical framework, the complex interplay between mutational processes, DNA repair and selection can be evaluated both spatially (along a sequence alignment) and temporally (for each branch of the tree) providing visual indicators to the attributes and functions of DNA sequences

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Australian National University

University of Melbourne Institutional Repository

Statistical methods for detecting periodic fragments in DNA sequence data

Author: Epps Julien
Huttley Gavin A
Ying Hua
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Period 10 dinucleotides are structurally and functionally validated factors that influence the ability of DNA to form nucleosomes, histone core octamers. Robust identification of periodic signals in DNA sequences is therefore required to understand nucleosome organisation in genomes. While various techniques for identifying periodic components in genomic sequences have been proposed or adopted, the requirements for such techniques have not been considered in detail and confirmatory testing for a priori specified periods has not been developed. Results We compared the estimation accuracy and suitability for confirmatory testing of autocorrelation, discrete Fourier transform (DFT), integer period discrete Fourier transform (IPDFT) and a previously proposed Hybrid measure. A number of different statistical significance procedures were evaluated but a blockwise bootstrap proved superior. When applied to synthetic data whose period-10 signal had been eroded, or for which the signal was approximately period-10, the Hybrid technique exhibited superior properties during exploratory period estimation. In contrast, confirmatory testing using the blockwise bootstrap procedure identified IPDFT as having the greatest statistical power. These properties were validated on yeast sequences defined from a ChIP-chip study where the Hybrid metric confirmed the expected dominance of period-10 in nucleosome associated DNA but IPDFT identified more significant occurrences of period-10. Application to the whole genomes of yeast and mouse identified ~ 21% and ~ 19% respectively of these genomes as spanned by period-10 nucleosome positioning sequences (NPS). Conclusions For estimating the dominant period, we find the Hybrid period estimation method empirically to be the most effective for both eroded and approximate periodicity. The blockwise bootstrap was found to be effective as a significance measure, performing particularly well in the problem of period detection in the presence of eroded periodicity. The autocorrelation method was identified as poorly suited for use with the blockwise bootstrap. Application of our methods to the genomes of two model organisms revealed a striking proportion of the yeast and mouse genomes are spanned by NPS. Despite their markedly different sizes, roughly equivalent proportions (19-21%) of the genomes lie within period-10 spans of the NPS dinucleotides {<it>AA, TT, TA</it>}. The biological significance of these regions remains to be demonstrated. To facilitate this, the genomic coordinates are available as Additional files 1, 2, and 3 in a format suitable for visualisation as tracks on popular genome browsers. Reviewers This article was reviewed by Prof Tomas Radivoyevitch, Dr Vsevolod Makeev (nominated by Dr Mikhail Gelfand), and Dr Rob D Knight.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Australian National University

Species abundance information improves sequence taxonomy classification accuracy.

Author: Bokulich Nicholas A
Caporaso J Gregory
Huttley Gavin A
Kaehler Benjamin D
Knight Rob
McDonald Daniel
Publication venue: eScholarship, University of California
Publication date: 01/10/2019
Field of study

Popular naive Bayes taxonomic classifiers for amplicon sequences assume that all species in the reference database are equally likely to be observed. We demonstrate that classification accuracy degrades linearly with the degree to which that assumption is violated, and in practice it is always violated. By incorporating environment-specific taxonomic abundance information, we demonstrate a significant increase in the species-level classification accuracy across common sample types. At the species level, overall average error rates decline from 25% to 14%, which is favourably comparable to the error rates that existing classifiers achieve at the genus level (16%). Our findings indicate that for most practical purposes, the assumption that reference species are equally likely to be observed is untenable. q2-clawback provides a straightforward alternative for samples from common environments

Repository for Publications and Research Data

eScholarship - University of California

Pitfalls of the most commonly used models of context dependent substitution

Author: Huttley Gavin A
Lindsay Helen
Yap Von Bing
Ying Hua
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Correction to Lindsay H, Yap VB, Ying H, Huttley GA: Pitfalls of the most commonly used models of context dependent substitution. Biology Direct 2008, 3:5

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Australian National University

ScholarBank@NUS

Pathological rate matrices: from primates to pathogens

Author: Easteal Simon
Huttley Gavin A
Knight Rob
Schranz Harold W
Yap Von Bing
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Continuous-time Markov models allow flexible, parametrically succinct descriptions of sequence divergence. Non-reversible forms of these models are more biologically realistic but are challenging to develop. The instantaneous rate matrices defined for these models are typically transformed into substitution probability matrices using a matrix exponentiation algorithm that employs eigendecomposition, but this algorithm has characteristic vulnerabilities that lead to significant errors when a rate matrix possesses certain 'pathological' properties. Here we tested whether pathological rate matrices exist in nature, and consider the suitability of different algorithms to their computation. Results We used concatenated protein coding gene alignments from microbial genomes, primate genomes and independent intron alignments from primate genomes. The Taylor series expansion and eigendecomposition matrix exponentiation algorithms were compared to the less widely employed, but more robust, Padé with scaling and squaring algorithm for nucleotide, dinucleotide, codon and trinucleotide rate matrices. Pathological dinucleotide and trinucleotide matrices were evident in the microbial data set, affecting the eigendecomposition and Taylor algorithms respectively. Even using a conservative estimate of matrix error (occurrence of an invalid probability), both Taylor and eigendecomposition algorithms exhibited substantial error rates: ~100% of all exonic trinucleotide matrices were pathological to the Taylor algorithm while ~10% of codon positions 1 and 2 dinucleotide matrices and intronic trinucleotide matrices, and ~30% of codon matrices were pathological to eigendecomposition. The majority of Taylor algorithm errors derived from occurrence of multiple unobserved states. A small number of negative probabilities were detected from the Pad�� algorithm on trinucleotide matrices that were attributable to machine precision. Although the Padé algorithm does not facilitate caching of intermediate results, it was up to 3× faster than eigendecomposition on the same matrices. Conclusion Development of robust software for computing non-reversible dinucleotide, codon and higher evolutionary models requires implementation of the Padé with scaling and squaring algorithm.</p

ANU Digital Collections

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

The Australian National University

ScholarBank@NUS

Infection with a Virulent Strain of Wolbachia Disrupts Genome Wide-Patterns of Cytosine Methylation in the Mosquito Aedes aegypti

Author: Caragata Eric P.
Huttley Gavin A.
McGraw Elizabeth A.
O'Neill Scott L.
Popovici Jean
Rancès Edwige
Woolfit Megan
Ye Yixin H.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 11/12/2015
Field of study

BACKGROUND Cytosine methylation is one of several reversible epigenetic modifications of DNA that allow a greater flexibility in the relationship between genotype and phenotype. Methylation in the simplest models dampens gene expression by modifying regions of DNA critical for transcription factor binding. The capacity to methylate DNA is variable in the insects due to diverse histories of gene loss and duplication of DNA methylases. Mosquitoes like Drosophila melanogaster possess only a single methylase, DNMT2. DESCRIPTION Here we characterise the methylome of the mosquito Aedes aegypti and examine its relationship to transcription and test the effects of infection with a virulent strain of the endosymbiont Wolbachia on the stability of methylation patterns. CONCLUSION We see that methylation in the A. aegypti genome is associated with reduced transcription and is most common in the promoters of genes relating to regulation of transcription and metabolism. Similar gene classes are also methylated in aphids and honeybees, suggesting either conservation or convergence of methylation patterns. In addition to this evidence of evolutionary stability, we also show that infection with the virulent wMelPop Wolbachia strain induces additional methylation and demethylation events in the genome. While most of these changes seem random with respect to gene function and have no detected effect on transcription, there does appear to be enrichment of genes associated with membrane function. Given that Wolbachia lives within a membrane-bound vacuole of host origin and retains a large number of genes for transporting host amino acids, inorganic ions and ATP despite a severely reduced genome, these changes might represent an evolved strategy for manipulating the host environments for its own gain. Testing for a direct link between these methylation changes and expression, however, will require study across a broader range of developmental stages and tissues with methods that detect splice variants.This research was supported by The National Health and Medical Research Council of Australia. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

The Australian National University

Species abundance information improves sequence taxonomy classification accuracy

Author: Bokulich Nicholas A
Caporaso J Gregory
Huttley Gavin
Kaehler Benjamin
Knight Rob
McDonald Daniel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/10/2019
Field of study

Repository for Publications and Research Data

eScholarship - University of California

The Australian National University

Detecting coevolution without phylogenetic trees? Tree-ignorant metrics of coevolution perform as well as tree-aware metrics

Author: Caporaso J Gregory
Easton Brett C
Hunter Lawrence
Huttley Gavin A
Knight Rob
Smit Sandra
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Identifying coevolving positions in protein sequences has myriad applications, ranging from understanding and predicting the structure of single molecules to generating proteome-wide predictions of interactions. Algorithms for detecting coevolving positions can be classified into two categories: tree-aware, which incorporate knowledge of phylogeny, and tree-ignorant, which do not. Tree-ignorant methods are frequently orders of magnitude faster, but are widely held to be insufficiently accurate because of a confounding of shared ancestry with coevolution. We conjectured that by using a null distribution that appropriately controls for the shared-ancestry signal, tree-ignorant methods would exhibit equivalent statistical power to tree-aware methods. Using a novel t-test transformation of coevolution metrics, we systematically compared four tree-aware and five tree-ignorant coevolution algorithms, applying them to myoglobin and myosin. We further considered the influence of sequence recoding using reduced-state amino acid alphabets, a common tactic employed in coevolutionary analyses to improve both statistical and computational performance. Results Consistent with our conjecture, the transformed tree-ignorant metrics (particularly Mutual Information) often outperformed the tree-aware metrics. Our examination of the effect of recoding suggested that charge-based alphabets were generally superior for identifying the stabilizing interactions in alpha helices. Performance was not always improved by recoding however, indicating that the choice of alphabet is critical. Conclusion The results suggest that t-test transformation of tree-ignorant metrics can be sufficient to control for patterns arising from shared ancestry.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

The Australian National University

Loss of ACTN3 gene function alters mouse muscle metabolism and shows evidence of positive selection in humans

Author: Berman Yemima
Easteal Simon
Edwards Michael R
Gunning Peter W
Hardeman Edna C
Hook Jeff W
Huttley Gavin Austin
Kee Anthony J
Lemckert Frances A
MacArthur Daniel
North Kathryn
Quinlan Kate G
Raftery Joanna M
Seto Jane T
Yang Nan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 08/12/2015
Field of study

More than a billion humans worldwide are predicted to be completely deficient in the fast skeletal muscle fiber protein α-actinin-3 owing to homozygosity for a premature stop codon polymorphism, R577X, in the ACTN3 gene. The R577X polymorphism is associ

The Australian National University

PyCogent: a toolkit for making sense from sequence

Author: Birmingham Amanda
Caporaso J Gregory
Carnes Jason
Easton Brett C
Eaton Michael
Hamady Micah
Huttley Gavin A
Knight Rob
Lindsay Helen
Liu Zongzhi
Lozupone Catherine
Maxwell Peter
McDonald Daniel
Robeson Michael
Sammut Raymond
Smit Sandra
Wakefield Matthew J
Widmann Jeremy
Wikman Shandy
Wilson Stephanie
Ying Hua
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

The COmparative GENomic Toolkit, a framework for probabilistic analyses of biological sequences, devising workflows and generating publication quality graphics, has been implemented in Python

Crossref

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

The Australian National University

University of Melbourne Institutional Repository