41,469 research outputs found
Predicting the outer membrane proteome of Pasteurella multocida based on consensus prediction enhanced by results integration and manual confirmation
Background
Outer membrane proteins (OMPs) of Pasteurella multocida have various functions related to virulence and pathogenesis and represent important targets for vaccine development. Various bioinformatic algorithms can predict outer membrane localization and discriminate OMPs by structure or function. The designation of a confident prediction framework by integrating different predictors followed by consensus prediction, results integration and manual confirmation will improve the prediction of the outer membrane proteome.
Results
In the present study, we used 10 different predictors classified into three groups (subcellular localization, transmembrane β-barrel protein and lipoprotein predictors) to identify putative OMPs from two available P. multocida genomes: those of avian strain Pm70 and porcine non-toxigenic strain 3480. Predicted proteins in each group were filtered by optimized criteria for consensus prediction: at least two positive predictions for the subcellular localization predictors, three for the transmembrane β-barrel protein predictors and one for the lipoprotein predictors. The consensus predicted proteins were integrated from each group into a single list of proteins. We further incorporated a manual confirmation step including a public database search against PubMed and sequence analyses, e.g. sequence and structural homology, conserved motifs/domains, functional prediction, and protein-protein interactions to enhance the confidence of prediction. As a result, we were able to confidently predict 98 putative OMPs from the avian strain genome and 107 OMPs from the porcine strain genome with 83% overlap between the two genomes.
Conclusions
The bioinformatic framework developed in this study has increased the number of putative OMPs identified in P. multocida and allowed these OMPs to be identified with a higher degree of confidence. Our approach can be applied to investigate the outer membrane proteomes of other Gram-negative bacteria
Genomic prediction and quantitative trait locus discovery in a cassava training population constructed from multiple breeding stages
Open Access Article; Published online: 11 Dec 2019Assembly of a training population (TP) is an important component of effective genomic selection‐based breeding programs. In this study, we examined the power of diverse germplasm assembled from two cassava (Manihot esculenta Crantz) breeding programs in Tanzania at different breeding stages to predict traits and discover quantitative trait loci (QTL). This is the first genomic selection and genome‐wide association study (GWAS) on Tanzanian cassava data. We detected QTL associated with cassava mosaic disease (CMD) resistance on chromosomes 12 and 16; QTL conferring resistance to cassava brown streak disease (CBSD) on chromosomes 9 and 11; and QTL on chromosomes 2, 3, 8, and 10 associated with resistance to CBSD for root necrosis. We detected a QTL on chromosome 4 and two QTL on chromosome 12 conferring dual resistance to CMD and CBSD. The use of clones in the same stage to construct TPs provided higher trait prediction accuracy than TPs with a mixture of clones from multiple breeding stages. Moreover, clones in the early breeding stage provided more reliable trait prediction accuracy and are better candidates for constructing a TP. Although larger TP sizes have been associated with improved accuracy, in this study, adding clones from Kibaha to those from Ukiriguru and vice versa did not improve the prediction accuracy of either population. Including the Ugandan TP in either population did not improve trait prediction accuracy. This study applied genomic prediction to understand the implications of constructing TP from clones at different breeding stages pooled from different locations on trait accuracy
Progressive Mauve: Multiple alignment of genomes with gene flux and rearrangement
Multiple genome alignment remains a challenging problem. Effects of
recombination including rearrangement, segmental duplication, gain, and loss
can create a mosaic pattern of homology even among closely related organisms.
We describe a method to align two or more genomes that have undergone
large-scale recombination, particularly genomes that have undergone substantial
amounts of gene gain and loss (gene flux). The method utilizes a novel
alignment objective score, referred to as a sum-of-pairs breakpoint score. We
also apply a probabilistic alignment filtering method to remove erroneous
alignments of unrelated sequences, which are commonly observed in other genome
alignment methods. We describe new metrics for quantifying genome alignment
accuracy which measure the quality of rearrangement breakpoint predictions and
indel predictions. The progressive genome alignment algorithm demonstrates
markedly improved accuracy over previous approaches in situations where genomes
have undergone realistic amounts of genome rearrangement, gene gain, loss, and
duplication. We apply the progressive genome alignment algorithm to a set of 23
completely sequenced genomes from the genera Escherichia, Shigella, and
Salmonella. The 23 enterobacteria have an estimated 2.46Mbp of genomic content
conserved among all taxa and total unique content of 15.2Mbp. We document
substantial population-level variability among these organisms driven by
homologous recombination, gene gain, and gene loss. Free, open-source software
implementing the described genome alignment approach is available from
http://gel.ahabs.wisc.edu/mauve .Comment: Revision dated June 19, 200
EGPred: prediction of eukaryotic genes using Ab initio methods after combining with sequence similarity approaches
EGPred is a Web-based server that combines ab initio methods and similarity searches to predict genes,
particularly exon regions, with high accuracy. The EGPred program proceeds in the following steps: (1) an initial
BLASTX search of genomic sequence against the RefSeq database is used to identify protein hits with an E-value <1;
(2) a second BLASTX search of genomic sequence against the hits from the previous run with relaxed parameters (E-values
<10) helps to retrieve all probable coding exon regions; (3) a BLASTN search of genomic sequence against the intron
database is then used to detect probable intron regions; (4) the probable intron and exon regions are compared to
filter/remove wrong exons; (5) the NNSPLICE program is then used to reassign splicing signal site positions in the
remaining probable coding exons; and (6) finally ab initio predictions are combined with exons derived from the fifth
step based on the relative strength of start/stop and splice signal sites as obtained from ab initio and similarity
search. The combination method increases the exon level performance of five different ab initio programs by 4%-10% when
evaluated on the HMR195 data set. Similar improvement is observed when ab initio programs are evaluated on the
Burset/Guigo data set. Finally, EGPred is demonstrated on an ~95-Mbp fragment of human chromosome 13. The list of
predicted genes from this analysis are available in the supplementary material. The EGPred program is computationally
intensive due to multiple BLAST runs during each analysis. The EGPred server is available at
http://www.imtech.res.in/raghava/egpred/
Inferring Energy Bounds via Static Program Analysis and Evolutionary Modeling of Basic Blocks
The ever increasing number and complexity of energy-bound devices (such as
the ones used in Internet of Things applications, smart phones, and mission
critical systems) pose an important challenge on techniques to optimize their
energy consumption and to verify that they will perform their function within
the available energy budget. In this work we address this challenge from the
software point of view and propose a novel parametric approach to estimating
tight bounds on the energy consumed by program executions that are practical
for their application to energy verification and optimization. Our approach
divides a program into basic (branchless) blocks and estimates the maximal and
minimal energy consumption for each block using an evolutionary algorithm. Then
it combines the obtained values according to the program control flow, using
static analysis, to infer functions that give both upper and lower bounds on
the energy consumption of the whole program and its procedures as functions on
input data sizes. We have tested our approach on (C-like) embedded programs
running on the XMOS hardware platform. However, our method is general enough to
be applied to other microprocessor architectures and programming languages. The
bounds obtained by our prototype implementation can be tight while remaining on
the safe side of budgets in practice, as shown by our experimental evaluation.Comment: Pre-proceedings paper presented at the 27th International Symposium
on Logic-Based Program Synthesis and Transformation (LOPSTR 2017), Namur,
Belgium, 10-12 October 2017 (arXiv:1708.07854). Improved version of the one
presented at the HIP3ES 2016 workshop (v1): more experimental results (added
benchmark to Table 1, added figure for new benchmark, added Table 3),
improved Fig. 1, added Fig.
Gene Function Classification Using Bayesian Models with Hierarchy-Based Priors
We investigate the application of hierarchical classification schemes to the
annotation of gene function based on several characteristics of protein
sequences including phylogenic descriptors, sequence based attributes, and
predicted secondary structure. We discuss three Bayesian models and compare
their performance in terms of predictive accuracy. These models are the
ordinary multinomial logit (MNL) model, a hierarchical model based on a set of
nested MNL models, and a MNL model with a prior that introduces correlations
between the parameters for classes that are nearby in the hierarchy. We also
provide a new scheme for combining different sources of information. We use
these models to predict the functional class of Open Reading Frames (ORFs) from
the E. coli genome. The results from all three models show substantial
improvement over previous methods, which were based on the C5 algorithm. The
MNL model using a prior based on the hierarchy outperforms both the
non-hierarchical MNL model and the nested MNL model. In contrast to previous
attempts at combining these sources of information, our approach results in a
higher accuracy rate when compared to models that use each data source alone.
Together, these results show that gene function can be predicted with higher
accuracy than previously achieved, using Bayesian models that incorporate
suitable prior information
Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning
For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [1] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%–13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology
Recommended from our members
Predicting glaucomatous visual field deterioration through short multivariate time series modelling
In bio-medical domains there are many
applications involving the modelling of
multivariate time series (MTS) data. One area
that has been largely overlooked so far is the
particular type of time series where the data set
consists of a large number of variables but with
a small number of observations. In this paper we
describe the development of a novel computational
method based on genetic algorithms that bypasses
the size restrictions of traditional statistical
MTS methods, makes no distribution assumptions,
and also locates the order and associated
parameters as a whole step. We apply this method to the prediction and modelling of glaucomatous
visual field deterioration
A quick guide for student-driven community genome annotation
High quality gene models are necessary to expand the molecular and genetic
tools available for a target organism, but these are available for only a
handful of model organisms that have undergone extensive curation and
experimental validation over the course of many years. The majority of gene
models present in biological databases today have been identified in draft
genome assemblies using automated annotation pipelines that are frequently
based on orthologs from distantly related model organisms. Manual curation is
time consuming and often requires substantial expertise, but is instrumental in
improving gene model structure and identification. Manual annotation may seem
to be a daunting and cost-prohibitive task for small research communities but
involving undergraduates in community genome annotation consortiums can be
mutually beneficial for both education and improved genomic resources. We
outline a workflow for efficient manual annotation driven by a team of
primarily undergraduate annotators. This model can be scaled to large teams and
includes quality control processes through incremental evaluation. Moreover, it
gives students an opportunity to increase their understanding of genome biology
and to participate in scientific research in collaboration with peers and
senior researchers at multiple institutions
- …