Search CORE

2,168 research outputs found

tRNA functional signatures classify plastids as late-branching cyanobacteria.

Author: Amrine Katherine Ch
Ardell David H
Lawrence Travis J
Swingley Wesley D
Publication venue: eScholarship, University of California
Publication date: 01/12/2019
Field of study

BackgroundEukaryotes acquired the trait of oxygenic photosynthesis through endosymbiosis of the cyanobacterial progenitor of plastid organelles. Despite recent advances in the phylogenomics of Cyanobacteria, the phylogenetic root of plastids remains controversial. Although a single origin of plastids by endosymbiosis is broadly supported, recent phylogenomic studies are contradictory on whether plastids branch early or late within Cyanobacteria. One underlying cause may be poor fit of evolutionary models to complex phylogenomic data.ResultsUsing Posterior Predictive Analysis, we show that recently applied evolutionary models poorly fit three phylogenomic datasets curated from cyanobacteria and plastid genomes because of heterogeneities in both substitution processes across sites and of compositions across lineages. To circumvent these sources of bias, we developed CYANO-MLP, a machine learning algorithm that consistently and accurately phylogenetically classifies ("phyloclassifies") cyanobacterial genomes to their clade of origin based on bioinformatically predicted function-informative features in tRNA gene complements. Classification of cyanobacterial genomes with CYANO-MLP is accurate and robust to deletion of clades, unbalanced sampling, and compositional heterogeneity in input tRNA data. CYANO-MLP consistently classifies plastid genomes into a late-branching cyanobacterial sub-clade containing single-cell, starch-producing, nitrogen-fixing ecotypes, consistent with metabolic and gene transfer data.ConclusionsPhylogenomic data of cyanobacteria and plastids exhibit both site-process heterogeneities and compositional heterogeneities across lineages. These aspects of the data require careful modeling to avoid bias in phylogenomic estimation. Furthermore, we show that amino acid recoding strategies may be insufficient to mitigate bias from compositional heterogeneities. However, the combination of our novel tRNA-specific strategy with machine learning in CYANO-MLP appears robust to these sources of bias with high accuracy in phyloclassification of cyanobacterial genomes. CYANO-MLP consistently classifies plastids as late-branching Cyanobacteria, consistent with independent evidence from signature-based approaches and some previous phylogenetic studies

Directory of Open Access Journals

eScholarship - University of California

Recovering complete and draft population genomes from metagenome datasets.

Author: Gilbert Jack A
Sangwan Naseer
Xia Fangfang
Publication venue: eScholarship, University of California
Publication date: 01/03/2016
Field of study

Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem of chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution

Woods Hole Open Access Server

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

An analysis of single amino acid repeats as use case for application specific background models

Author: C Notredame
David P Kreil
DP Depledge
DP Kreil
E Birney
E Delot
EL Sonnhammer
EM Marcotte
G Gouridis
G Nuel
G Reinert
H Gerber
H Nielsen
H Nielsen
IB Kuznetsov
J Thompson
J Wootton
J Xie
JD Bendtsen
JM Hancock
JW Fondon
L Brown
L Zhang
M Hoebeke
M Mar Alba
M Thomas-Chollier
M Tipping
M Tipping
MA Huntley
O Weiss
OB Ptitsyn
P Siwach
P Siwach
Paweł P Łabaj
Peter Sykacek
PP Łabaj
R Lopez
R Lyne
RI Sadreyev
RS Hegde
S Caburet
S Hands
S Henikoff
S Karlin
S Karlin
SF Altschul
SF Altschul
SF Altschul
T Koestler
VJ Promponas
VR Chechetkin
VS Pande
WR Pearson
Y Kashi
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background Sequence analysis aims to identify biologically relevant signals against a backdrop of functionally meaningless variation. Increasingly, it is recognized that the quality of the background model directly affects the performance of analyses. State-of-the-art approaches rely on classical sequence models that are adapted to the studied dataset. Although performing well in the analysis of globular protein domains, these models break down in regions of stronger compositional bias or low complexity. While these regions are typically filtered, there is increasing anecdotal evidence of functional roles. This motivates an exploration of more complex sequence models and application-specific approaches for the investigation of biased regions. Results Traditional Markov-chains and application-specific regression models are compared using the example of predicting runs of single amino acids, a particularly simple class of biased regions. Cross-fold validation experiments reveal that the alternative regression models capture the multi-variate trends well, despite their low dimensionality and in contrast even to higher-order Markov-predictors. We show how the significance of unusual observations can be computed for such empirical models. The power of a dedicated model in the detection of biologically interesting signals is then demonstrated in an analysis identifying the unexpected enrichment of contiguous leucine-repeats in signal-peptides. Considering different reference sets, we show how the question examined actually defines what constitutes the 'background'. Results can thus be highly sensitive to the choice of appropriate model training sets. Conversely, the choice of reference data determines the questions that can be investigated in an analysis. Conclusions Using a specific case of studying biased regions as an example, we have demonstrated that the construction of application-specific background models is both necessary and feasible in a challenging sequence analysis situation

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Publikationsserver der Universitätsbibliothek Bodenkultur Wien

Warwick Research Archives Portal Repository

Using a neural network to backtranslate amino acid sequences

Author: Seffens William
White Gilbert
Publication venue: Universidad Cat\uf3lica de Valpara\uedso
Publication date: 07/04/2003
Field of study

A neural network (NN) was trained on amino and nucleic acid sequences to test the NN's ability to predict a nucleic acid sequence given only an amino acid sequence. A multi-layer backpropagation network of one hidden layer with 5 to 9 neurons was used. Different network configurations were used with varying numbers of input neurons to represent amino acids, while a constant representation was used for the output layer representing nucleic acids. In the best-trained network, 93% of the overall bases, 85% of the degenerate bases, and 100% of the fixed bases were correctly predicted from randomly selected test sequences. The training set was composed of 60 human sequences in a window of 10 to 25 codons at the coding sequence start site. Different NN configurations involving the encoding of amino acids under increasing window sizes were evaluated to predict the behavior of the NN with a significantly larger training set. This genetic data analysis effort will assist in understanding human gene structure. Benefits include computational tools that could predict more reliably the backtranslation of amino acid sequences useful for Degenerate PCR cloning, and may assist the identification of human gene coding sequences (CDS) from open reading frames in DNA databases

Bioline International

New Methods to Study Proline-Rich Disordered Regions and Their Structural Ensembles in Protein Signaling Pathways

Author: LIU CHENGCHENG
Publication venue
Publication date: 24/08/2012
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Applications of Bioinformatics and Experimental Methods to Intrinsic Disorder-Based Protein-Protein Interactions

Author: Vladimir N. Uversky
William T. Jones
Xiaolin Sun
Publication venue: 'IntechOpen'
Publication date: 01/01/2012
Field of study

IntechOpen

Scholar Commons - University of South Florida