59,145 research outputs found

    A Factor Graph Approach to Automated GO Annotation

    Get PDF
    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum.Fil: Spetale, Flavio Ezequiel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; ArgentinaFil: Krsticevic, Flavia Jorgelina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; ArgentinaFil: Roda, Fernando. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; ArgentinaFil: Bulacio, Pilar Estela. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; Argentin

    CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures

    Get PDF
    We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure–based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification

    Machine learning applied to enzyme turnover numbers reveals protein structural correlates and improves metabolic models.

    Get PDF
    Knowing the catalytic turnover numbers of enzymes is essential for understanding the growth rate, proteome composition, and physiology of organisms, but experimental data on enzyme turnover numbers is sparse and noisy. Here, we demonstrate that machine learning can successfully predict catalytic turnover numbers in Escherichia coli based on integrated data on enzyme biochemistry, protein structure, and network context. We identify a diverse set of features that are consistently predictive for both in vivo and in vitro enzyme turnover rates, revealing novel protein structural correlates of catalytic turnover. We use our predictions to parameterize two mechanistic genome-scale modelling frameworks for proteome-limited metabolism, leading to significantly higher accuracy in the prediction of quantitative proteome data than previous approaches. The presented machine learning models thus provide a valuable tool for understanding metabolism and the proteome at the genome scale, and elucidate structural, biochemical, and network properties that underlie enzyme kinetics

    Assessing Protein Conformational Sampling Methods Based on Bivariate Lag-Distributions of Backbone Angles

    Get PDF
    Despite considerable progress in the past decades, protein structure prediction remains one of the major unsolved problems in computational biology. Angular-sampling-based methods have been extensively studied recently due to their ability to capture the continuous conformational space of protein structures. The literature has focused on using a variety of parametric models of the sequential dependencies between angle pairs along the protein chains. In this article, we present a thorough review of angular-sampling-based methods by assessing three main questions: What is the best distribution type to model the protein angles? What is a reasonable number of components in a mixture model that should be considered to accurately parameterize the joint distribution of the angles? and What is the order of the local sequence–structure dependency that should be considered by a prediction method? We assess the model fits for different methods using bivariate lag-distributions of the dihedral/planar angles. Moreover, the main information across the lags can be extracted using a technique called Lag singular value decomposition (LagSVD), which considers the joint distribution of the dihedral/planar angles over different lags using a nonparametric approach and monitors the behavior of the lag-distribution of the angles using singular value decomposition. As a result, we developed graphical tools and numerical measurements to compare and evaluate the performance of different model fits. Furthermore, we developed a web-tool (http://www.stat.tamu.edu/∼madoliat/LagSVD) that can be used to produce informative animations

    Structural analysis of the adenovirus type 2 E3/19K protein using mutagenesis and a panel of conformation-sensitive monoclonal antibodies

    Get PDF
    The E3/19K protein of human adenovirus type 2 (Ad2) was the first viral protein shown to interfere with antigen presentation. This 25 kDa transmembrane glycoprotein binds to major histocompatibility complex (MHC) class I molecules in the endoplasmic reticulum (ER), thereby preventing transport of newly synthesized peptide–MHC complexes to the cell surface and consequently T cell recognition. Recent data suggest that E3/19K also sequesters MHC class I like ligands intracellularly to suppress natural killer (NK) cell recognition. While the mechanism of ER retention is well understood, the structure of E3/19K remains elusive. To further dissect the structural and antigenic topography of E3/19K we carried out site-directed mutagenesis and raised monoclonal antibodies (mAbs) against a recombinant version of Ad2 E3/19K comprising the lumenal domain followed by a C-terminal histidine tag. Using peptide scanning, the epitopes of three mAbs were mapped to different regions of the lumenal domain, comprising amino acids 3–13, 15–21 and 41–45, respectively. Interestingly, mAb 3F4 reacted only weakly with wild-type E3/19K, but showed drastically increased binding to mutant E3/19K molecules, e.g. those with disrupted disulfide bonds, suggesting that 3F4 can sense unfolding of the protein. MAb 10A2 binds to an epitope apparently buried within E3/19K while that of 3A9 is exposed. Secondary structure prediction suggests that the lumenal domain contains six β-strands and an α-helix adjacent to the transmembrane domain. Interestingly, all mAbs bind to non-structured loops. Using a large panel of E3/19K mutants the structural alterations of the mutations were determined. With this knowledge the panel of mAbs will be valuable tools to further dissect structure/function relationships of E3/19K regarding down regulation of MHC class I and MHC class I like molecules and its effect on both T cell and NK cell recognition

    101 Dothideomycetes genomes: A test case for predicting lifestyles and emergence of pathogens.

    Get PDF
    Dothideomycetes is the largest class of kingdom Fungi and comprises an incredible diversity of lifestyles, many of which have evolved multiple times. Plant pathogens represent a major ecological niche of the class Dothideomycetes and they are known to infect most major food crops and feedstocks for biomass and biofuel production. Studying the ecology and evolution of Dothideomycetes has significant implications for our fundamental understanding of fungal evolution, their adaptation to stress and host specificity, and practical implications with regard to the effects of climate change and on the food, feed, and livestock elements of the agro-economy. In this study, we present the first large-scale, whole-genome comparison of 101 Dothideomycetes introducing 55 newly sequenced species. The availability of whole-genome data produced a high-confidence phylogeny leading to reclassification of 25 organisms, provided a clearer picture of the relationships among the various families, and indicated that pathogenicity evolved multiple times within this class. We also identified gene family expansions and contractions across the Dothideomycetes phylogeny linked to ecological niches providing insights into genome evolution and adaptation across this group. Using machine-learning methods we classified fungi into lifestyle classes with &gt;95&nbsp;% accuracy and identified a small number of gene families that positively correlated with these distinctions. This can become a valuable tool for genome-based prediction of species lifestyle, especially for rarely seen and poorly studied species
    corecore