Search CORE

154 research outputs found

Vestige: Maximum likelihood phylogenetic footprinting

Author: Huttley Gavin A
Maxwell Peter
Wakefield Matthew J
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: Phylogenetic footprinting is the identification of functional regions of DNA by their evolutionary conservation. This is achieved by comparing orthologous regions from multiple species and identifying the DNA regions that have diverged less than neutral DNA. Vestige is a phylogenetic footprinting package built on the PyEvolve toolkit that uses probabilistic molecular evolutionary modelling to represent aspects of sequence evolution, including the conventional divergence measure employed by other footprinting approaches. In addition to measuring the divergence, Vestige allows the expansion of the definition of a phylogenetic footprint to include variation in the distribution of any molecular evolutionary processes. This is achieved by displaying the distribution of model parameters that represent partitions of molecular evolutionary substitutions. Examination of the spatial incidence of these effects across regions of the genome can identify DNA segments that differ in the nature of the evolutionary process. RESULTS: Vestige was applied to a reference dataset of the SCL locus from four species and provided clear identification of the known conserved regions in this dataset. To demonstrate the flexibility to use diverse models of molecular evolution and dissect the nature of the evolutionary process Vestige was used to footprint the Ka/Ks ratio in primate BRCA1 with a codon model of evolution. Two regions of putative adaptive evolution were identified illustrating the ability of Vestige to represent the spatial distribution of distinct molecular evolutionary processes. CONCLUSION: Vestige provides a flexible, open platform for phylogenetic footprinting. Underpinned by the PyEvolve toolkit, Vestige provides a framework for visualising the signatures of evolutionary processes across the genome of numerous organisms simultaneously. By exploiting the maximum-likelihood statistical framework, the complex interplay between mutational processes, DNA repair and selection can be evaluated both spatially (along a sequence alignment) and temporally (for each branch of the tree) providing visual indicators to the attributes and functions of DNA sequences

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Australian National University

University of Melbourne Institutional Repository

Statistical methods for detecting periodic fragments in DNA sequence data

Author: Epps Julien
Huttley Gavin A
Ying Hua
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Period 10 dinucleotides are structurally and functionally validated factors that influence the ability of DNA to form nucleosomes, histone core octamers. Robust identification of periodic signals in DNA sequences is therefore required to understand nucleosome organisation in genomes. While various techniques for identifying periodic components in genomic sequences have been proposed or adopted, the requirements for such techniques have not been considered in detail and confirmatory testing for a priori specified periods has not been developed. Results We compared the estimation accuracy and suitability for confirmatory testing of autocorrelation, discrete Fourier transform (DFT), integer period discrete Fourier transform (IPDFT) and a previously proposed Hybrid measure. A number of different statistical significance procedures were evaluated but a blockwise bootstrap proved superior. When applied to synthetic data whose period-10 signal had been eroded, or for which the signal was approximately period-10, the Hybrid technique exhibited superior properties during exploratory period estimation. In contrast, confirmatory testing using the blockwise bootstrap procedure identified IPDFT as having the greatest statistical power. These properties were validated on yeast sequences defined from a ChIP-chip study where the Hybrid metric confirmed the expected dominance of period-10 in nucleosome associated DNA but IPDFT identified more significant occurrences of period-10. Application to the whole genomes of yeast and mouse identified ~ 21% and ~ 19% respectively of these genomes as spanned by period-10 nucleosome positioning sequences (NPS). Conclusions For estimating the dominant period, we find the Hybrid period estimation method empirically to be the most effective for both eroded and approximate periodicity. The blockwise bootstrap was found to be effective as a significance measure, performing particularly well in the problem of period detection in the presence of eroded periodicity. The autocorrelation method was identified as poorly suited for use with the blockwise bootstrap. Application of our methods to the genomes of two model organisms revealed a striking proportion of the yeast and mouse genomes are spanned by NPS. Despite their markedly different sizes, roughly equivalent proportions (19-21%) of the genomes lie within period-10 spans of the NPS dinucleotides {<it>AA, TT, TA</it>}. The biological significance of these regions remains to be demonstrated. To facilitate this, the genomic coordinates are available as Additional files 1, 2, and 3 in a format suitable for visualisation as tracks on popular genome browsers. Reviewers This article was reviewed by Prof Tomas Radivoyevitch, Dr Vsevolod Makeev (nominated by Dr Mikhail Gelfand), and Dr Rob D Knight.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Australian National University

Species abundance information improves sequence taxonomy classification accuracy.

Author: Bokulich Nicholas A
Caporaso J Gregory
Huttley Gavin A
Kaehler Benjamin D
Knight Rob
McDonald Daniel
Publication venue: eScholarship, University of California
Publication date: 01/10/2019
Field of study

Popular naive Bayes taxonomic classifiers for amplicon sequences assume that all species in the reference database are equally likely to be observed. We demonstrate that classification accuracy degrades linearly with the degree to which that assumption is violated, and in practice it is always violated. By incorporating environment-specific taxonomic abundance information, we demonstrate a significant increase in the species-level classification accuracy across common sample types. At the species level, overall average error rates decline from 25% to 14%, which is favourably comparable to the error rates that existing classifiers achieve at the genus level (16%). Our findings indicate that for most practical purposes, the assumption that reference species are equally likely to be observed is untenable. q2-clawback provides a straightforward alternative for samples from common environments

Repository for Publications and Research Data

eScholarship - University of California

Recommended from our members

redbiom: a Rapid Sample Discovery and Feature Characterization System.

Author: Ackermann Gail
DeReus Jeff
Gonzalez Antonio
Huttley Gavin
Kaehler Benjamin
Knight Rob
Marotz Clarisse
McDonald Daniel
Publication venue: eScholarship, University of California
Publication date: 01/06/2019
Field of study

Meta-analyses at the whole-community level have been important in microbiome studies, revealing profound features that structure Earth's microbial communities, such as the unique differentiation of microbes from the mammalian gut relative to free-living microbial communities, the separation of microbiomes in saline and nonsaline environments, and the role of pH in driving soil microbial compositions. However, our ability to identify the specific features of a microbiome that differentiate these community-level patterns have lagged behind, especially as ever-cheaper DNA sequencing has yielded increasingly large data sets. One critical gap is the ability to search for samples that contain specific features (for example, sub-operational taxonomic units [sOTUs] identified by high-resolution statistical methods for removing amplicon sequencing errors). Here we introduce redbiom, a microbiome caching layer, which allows users to rapidly query samples that contain a given feature, retrieve sample data and metadata, and search for samples that match specified metadata values or ranges (e.g., all samples with a pH of >7), implemented using an in-memory NoSQL database called Redis. By default, redbiom allows public anonymous sample access for over 100,000 publicly available samples in the Qiita database. At over 100,000 samples, the caching server requires only 35 GB of resident memory. We highlight how redbiom enables a new type of characterization of microbiome samples and provide tutorials for using redbiom with QIIME 2. redbiom is open source under the BSD license, hosted on GitHub, and can be deployed independently of Qiita to enable search of proprietary or clinically restricted microbiome databases.IMPORTANCE Although analyses that combine many microbiomes at the whole-community level have become routine, searching rapidly for microbiomes that contain a particular sequence has remained difficult. The software we present here, redbiom, dramatically accelerates this process, allowing samples that contain microbiome features to be rapidly identified. This is especially useful when taxonomic annotation is limited, allowing users to identify environments in which unannotated microbes of interest were previously observed. This approach also allows environmental or clinical factors that correlate with specific features, or vice versa, to be identified rapidly, even at a scale of billions of sequences in hundreds of thousands of samples. The software is integrated with existing analysis tools to enable fast, large-scale microbiome searches and discovery of new microbiome relationships

eScholarship - University of California

The Australian National University

Pitfalls of the most commonly used models of context dependent substitution

Author: Huttley Gavin A
Lindsay Helen
Yap Von Bing
Ying Hua
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Correction to Lindsay H, Yap VB, Ying H, Huttley GA: Pitfalls of the most commonly used models of context dependent substitution. Biology Direct 2008, 3:5

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Australian National University

ScholarBank@NUS

Estimates of the effect of natural selection on protein-coding content

Author: Easteal Simon
Huttley Gavin Austin
Lindsay Helen
Yap Von Bing
Publication venue: 'Oxford University Press (OUP)'
Publication date: 24/02/2016
Field of study

Analysis of natural selection is key to understanding many core biological processes, including the emergence of competition, cooperation, and complexity, and has important applications in the targeted development of vaccines. Selection is hard to observe directly but can be inferred from molecular sequence variation. For protein-coding nucleotide sequences, the ratio of nonsynonymous to synonymous substitutions (ω) distinguishes neutrally evolving sequences (ω = 1) from those subjected to purifying (ω 1) selection. We show that current models used to estimate ω are substantially biased by naturally occurring sequence compositions. We present a novel model that weights substitutions by conditional nucleotide frequencies and which escapes these artifacts. Applying it to the genomes of pathogens causing malaria, leprosy, tuberculosis, and Lyme disease gave significant discrepancies in estimates with ∼10-30% of genes affected. Our work has substantial implications for how vaccine targets are chosen and for studying the molecular basis of adaptive evolution

The Australian National University

Pathological rate matrices: from primates to pathogens

Author: Easteal Simon
Huttley Gavin A
Knight Rob
Schranz Harold W
Yap Von Bing
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Continuous-time Markov models allow flexible, parametrically succinct descriptions of sequence divergence. Non-reversible forms of these models are more biologically realistic but are challenging to develop. The instantaneous rate matrices defined for these models are typically transformed into substitution probability matrices using a matrix exponentiation algorithm that employs eigendecomposition, but this algorithm has characteristic vulnerabilities that lead to significant errors when a rate matrix possesses certain 'pathological' properties. Here we tested whether pathological rate matrices exist in nature, and consider the suitability of different algorithms to their computation. Results We used concatenated protein coding gene alignments from microbial genomes, primate genomes and independent intron alignments from primate genomes. The Taylor series expansion and eigendecomposition matrix exponentiation algorithms were compared to the less widely employed, but more robust, Padé with scaling and squaring algorithm for nucleotide, dinucleotide, codon and trinucleotide rate matrices. Pathological dinucleotide and trinucleotide matrices were evident in the microbial data set, affecting the eigendecomposition and Taylor algorithms respectively. Even using a conservative estimate of matrix error (occurrence of an invalid probability), both Taylor and eigendecomposition algorithms exhibited substantial error rates: ~100% of all exonic trinucleotide matrices were pathological to the Taylor algorithm while ~10% of codon positions 1 and 2 dinucleotide matrices and intronic trinucleotide matrices, and ~30% of codon matrices were pathological to eigendecomposition. The majority of Taylor algorithm errors derived from occurrence of multiple unobserved states. A small number of negative probabilities were detected from the Pad�� algorithm on trinucleotide matrices that were attributable to machine precision. Although the Padé algorithm does not facilitate caching of intermediate results, it was up to 3× faster than eigendecomposition on the same matrices. Conclusion Development of robust software for computing non-reversible dinucleotide, codon and higher evolutionary models requires implementation of the Padé with scaling and squaring algorithm.</p

ANU Digital Collections

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

The Australian National University

ScholarBank@NUS

Machine Learning Techniques for Classifying the Mutagenic Origins of Point Mutations

Author: Huttley Gavin
Ong Cheng Soon
Zhu Yicheng
Publication venue: 'Genetics Society of America'
Publication date: 19/07/2020
Field of study

There is increasing interest in developing diagnostics that discriminate individual mutagenic mechanisms in a range of applications that include identifying population-specific mutagenesis and resolving distinct mutation signatures in cancer samples. Analyses for these applications assume that mutagenic mechanisms have a distinct relationship with neighboring bases that allows them to be distinguished. Direct support for this assumption is limited to a small number of simple cases, e.g., CpG hypermutability. We have evaluated whether the mechanistic origin of a point mutation can be resolved using only sequence context for a more complicated case. We contrasted single nucleotide variants originating from the multitude of mutagenic processes that normally operate in the mouse germline with those induced by the potent mutagen N-ethyl-N-nitrosourea (ENU). The considerable overlap in the mutation spectra of these two samples make this a challenging problem. Employing a new, robust log-linear modeling method, we demonstrate that neighboring bases contain information regarding point mutation direction that differs between the ENU-induced and spontaneous mutation variant classes. A logistic regression classifier exhibited strong performance at discriminating between the different mutation classes. Concordance between the feature set of the best classifier and information content analyses suggest our results can be generalized to other mutation classification problems. We conclude that machine learning can be used to build a practical classification tool to identify the mutation mechanism for individual genetic variants. Software implementing our approach is freely available under an open-source license

The Australian National University

Infection with a Virulent Strain of Wolbachia Disrupts Genome Wide-Patterns of Cytosine Methylation in the Mosquito Aedes aegypti

Author: Caragata Eric P.
Huttley Gavin A.
McGraw Elizabeth A.
O'Neill Scott L.
Popovici Jean
Rancès Edwige
Woolfit Megan
Ye Yixin H.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 11/12/2015
Field of study

BACKGROUND Cytosine methylation is one of several reversible epigenetic modifications of DNA that allow a greater flexibility in the relationship between genotype and phenotype. Methylation in the simplest models dampens gene expression by modifying regions of DNA critical for transcription factor binding. The capacity to methylate DNA is variable in the insects due to diverse histories of gene loss and duplication of DNA methylases. Mosquitoes like Drosophila melanogaster possess only a single methylase, DNMT2. DESCRIPTION Here we characterise the methylome of the mosquito Aedes aegypti and examine its relationship to transcription and test the effects of infection with a virulent strain of the endosymbiont Wolbachia on the stability of methylation patterns. CONCLUSION We see that methylation in the A. aegypti genome is associated with reduced transcription and is most common in the promoters of genes relating to regulation of transcription and metabolism. Similar gene classes are also methylated in aphids and honeybees, suggesting either conservation or convergence of methylation patterns. In addition to this evidence of evolutionary stability, we also show that infection with the virulent wMelPop Wolbachia strain induces additional methylation and demethylation events in the genome. While most of these changes seem random with respect to gene function and have no detected effect on transcription, there does appear to be enrichment of genes associated with membrane function. Given that Wolbachia lives within a membrane-bound vacuole of host origin and retains a large number of genes for transporting host amino acids, inorganic ions and ATP despite a severely reduced genome, these changes might represent an evolved strategy for manipulating the host environments for its own gain. Testing for a direct link between these methylation changes and expression, however, will require study across a broader range of developmental stages and tissues with methods that detect splice variants.This research was supported by The National Health and Medical Research Council of Australia. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

The Australian National University

Species abundance information improves sequence taxonomy classification accuracy

Author: Bokulich Nicholas A
Caporaso J Gregory
Huttley Gavin
Kaehler Benjamin
Knight Rob
McDonald Daniel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/10/2019
Field of study

Repository for Publications and Research Data

eScholarship - University of California

The Australian National University