835 research outputs found

    Large-scale automated protein function prediction

    Get PDF
    Includes bibliographical references.2016 Summer.Proteins are the workhorses of life, and identifying their functions is a very important biological problem. The function of a protein can be loosely defined as everything it performs or happens to it. The Gene Ontology (GO) is a structured vocabulary which captures protein function in a hierarchical manner and contains thousands of terms. Through various wet-lab experiments over the years scientists have been able to annotate a large number of proteins with GO categories which reflect their functionality. However, experimentally determining protein functions is a highly resource-intensive task, and a large fraction of proteins remain un-annotated. Recently a plethora automated methods have emerged and their reasonable success in computationally determining the functions of proteins using a variety of data sources ā€“ by sequence/structure similarity or using various biological network data, has led to establishing automated function prediction (AFP) as an important problem in bioinformatics. In a typical machine learning problem, cross-validation is the protocol of choice for evaluating the accuracy of a classifier. But, due to the process of accumulation of annotations over time, we identify the AFP as a combination of two sub-tasks: making predictions on annotated proteins and making predictions on previously unannotated proteins. In our first project, we analyze the performance of several protein function prediction methods in these two scenarios. Our results show that GOstruct, an AFP method that our lab has previously developed, and two other popular methods: binary SVMs and guilt by association, find it hard to achieve the same level of accuracy on these two tasks compared to the performance evaluated through cross-validation, and that predicting novel annotations for previously annotated proteins is a harder problem than predicting annotations for uncharacterized proteins. We develop GOstruct 2.0 by proposing improvements which allows the model to make use of information of a protein's current annotations to better handle the task of predicting novel annotations for previously annotated proteins. Experimental results on yeast and human data show that GOstruct 2.0 outperforms the original GOstruct, demonstrating the effectiveness of the proposed improvements. Although the biomedical literature is a very informative resource for identifying protein function, most AFP methods do not take advantage of the large amount of information contained in it. In our second project, we conduct the first ever comprehensive evaluation on the effectiveness of literature data for AFP. Specifically, we extract co-mentions of protein-GO term pairs and bag-of-words features from the literature and explore their effectiveness in predicting protein function. Our results show that literature features are very informative of protein function but with further room for improvement. In order to improve the quality of automatically extracted co-mentions, we formulate the classification of co-mentions as a supervised learning problem and propose a novel method based on graph kernels. Experimental results indicate the feasibility of using this co-mention classifier as a complementary method that aids the bio-curators who are responsible for maintaining databases such as Gene Ontology. This is the first study of the problem of protein-function relation extraction from biomedical text. The recently developed human phenotype ontology (HPO), which is very similar to GO, is a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. At present, only a small fraction of human protein coding genes have HPO annotations. But, researchers believe that a large portion of currently unannotated genes are related to disease phenotypes. Therefore, it is important to predict gene-HPO term associations using accurate computational methods. In our third project, we introduce PHENOstruct, a computational method that directly predicts the set of HPO terms for a given gene. We compare PHENOstruct with several baseline methods and show that it outperforms them in every respect. Furthermore, we highlight a collection of informative data sources suitable for the problem of predicting gene-HPO associations, including large scale literature mining data

    MPRAP: An accessibility predictor for a-helical transmem-brane proteins that performs well inside and outside the membrane

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In water-soluble proteins it is energetically favorable to bury hydrophobic residues and to expose polar and charged residues. In contrast to water soluble proteins, transmembrane proteins face three distinct environments; a hydrophobic lipid environment inside the membrane, a hydrophilic water environment outside the membrane and an interface region rich in phospholipid head-groups. Therefore, it is energetically favorable for transmembrane proteins to expose different types of residues in the different regions.</p> <p>Results</p> <p>Investigations of a set of structurally determined transmembrane proteins showed that the composition of solvent exposed residues differs significantly inside and outside the membrane. In contrast, residues buried within the interior of a protein show a much smaller difference. However, in all regions exposed residues are less conserved than buried residues. Further, we found that current state-of-the-art predictors for surface area are optimized for one of the regions and perform badly in the other regions. To circumvent this limitation we developed a new predictor, MPRAP, that performs well in all regions. In addition, MPRAP performs better on complete membrane proteins than a combination of specialized predictors and acceptably on water-soluble proteins. A web-server of MPRAP is available at <url>http://mprap.cbr.su.se/</url></p> <p>Conclusion</p> <p>By including complete <it>a</it>-helical transmembrane proteins in the training MPRAP is able to predict surface accessibility accurately both inside and outside the membrane. This predictor can aid in the prediction of 3D-structure, and in the identification of erroneous protein structures.</p

    Leveraging expression and network data for protein function prediction

    Get PDF
    2012 Summer.Includes bibliographical references.Protein function prediction is one of the prominent problems in bioinformatics today. Protein annotation is slowly falling behind as more and more genomes are being sequenced. Experimental methods are expensive and time consuming, which leaves computational methods to fill the gap. While computational methods are still not accurate enough to be used without human supervision, this is the goal. The Gene Ontology (GO) is a collection of terms that are the standard for protein function annotations. Because of the structure of GO, protein function prediction is a hierarchical multi-label classification problem. The classification method used in this thesis is GOstruct, which performs structured predictions that take into account all GO terms. GOstruct has been shown to work well, but there are still improvements to be made. In this thesis, I work to improve predictions by building new kernels from the data that are used by GOstruct. To do this, I find key representations of the data that help define what kernels perform best on the variety of data types. I apply this methodology to function prediction in two model organisms, Saccharomyces cerevisiae and Mus musculus, and found better methods for interpreting the data

    Functional and structural analysis of FST1 in Fusarium verticillioides

    Get PDF
    Fusarium verticillioides causes an important seed disease on maize and produces fumonisin B1 (FB1), a mycotoxin that is detrimental to human and animal health. Previous studies discovered that expression of FST1 is required for FB1 production and wild-type level of virulence on maize seeds. FST1 encodes a putative protein with 12 transmembrane domains with sequence similarity to hexose transporters. However, those studies have failed to prove its ability to transport glucose, fructose or mannose. I identified another three phenotypes associated with the lack of a functional FST1, which includes reduced hydrophobicity of hyphae, reduced macroconidia production, and increased sensitivity to hydrogen peroxide. My research compared the transcriptome of the wild type and strain Ī”fst1 when grown on autoclaved maize kernels. The 17 % of transcriptome (2677 genes) were differentially expressed. Examination of these genes indicated that the disruption of FST1 function affected genes involved in secondary metabolism, cell structure, conidiogenesis, virulence, and resistance to reactive oxygen species. Additionally, I used a Saccharomyces cerevisiae strain (Ī”itr1) lacking a functional inositol transporter gene (ITR1) to study the function of FST1. This yeast mutant grows poorly in myo-inositol medium and is not inhibited by FB1. I found that expression of FST1 in strain Ī”itr1 restored growth on myo-inositol medium and sensitivity to FB1 to levels observed in the wild-type yeast strain. The results indicate that FST1 can function as an inositol transporter and suggests it can transport FB1 into fungal cells. Finally, the functional importance of amino acids in FST1 was examined by creating targeted mutations in the central loop and C-terminus regions of the protein. Expression of these engineered FST1 genes in stain Ī”itr1 of S. cerevisiae and strain Ī”fst1 of F. verticillioidesindicated that both the central loop and C-terminus are critical for FST1 functionality. Overall this research has established the first characterized inositol transporter in filamentous fungi and has advanced our knowledge about the global regulatory functions of FST1

    The Role of Intracellular Interactions in the Collective Polarization of Tissues and its Interplay with Cellular Geometry

    Full text link
    Planar cell polarity (PCP), the coherent in-plane polarization of a tissue on multicellular length scales, provides directional information that guides a multitude of developmental processes at cellular and tissue levels. While it is manifest that cells utilize both intracellular and intercellular mechanisms, how the two produce the collective polarization remains an active area of investigation. We study the role of intracellular interactions in the large-scale spatial coherence of cell polarities, and scrutinize the role of intracellular interactions in the emergence of tissue-wide polarization. We demonstrate that nonlocal cytoplasmic interactions are necessary and sufficient for the robust long-range polarization, and are essential to the faithful detection of weak directional signals. In the presence of nonlocal interactions, signatures of geometrical information in tissue polarity become manifest. We investigate the deleterious effects of geometric disorder, and determine conditions on the cytoplasmic interactions that guarantee the stability of polarization. These conditions get progressively more stringent upon increasing the geometric disorder. Another situation where the role of geometrical information might be evident is elongated tissues. Strikingly, our model recapitulates an observed influence of tissue elongation on the orientation of polarity. Eventually, we introduce three classes of mutants: lack of membrane proteins, cytoplasmic proteins, and local geometrical irregularities. We adopt core-PCP as a model pathway, and interpret the model parameters accordingly, through comparing the in silico and in vivo phenotypes. This comparison helps us shed light on the roles of the cytoplasmic proteins in cell-cell communication, and make predictions regarding the cooperation of cytoplasmic and membrane proteins in long-range polarization.Comment: 15 pages Main Text + 8 page Appendi

    FFPred: an integrated feature-based function prediction server for vertebrate proteomes

    Get PDF
    One of the challenges of the post-genomic era is to provide accurate function annotations for large volumes of data resulting from genome sequencing projects. Most function prediction servers utilize methods that transfer existing database annotations between orthologous sequences. In contrast, there are few methods that are independent of homology and can annotate distant and orphan protein sequences. The FFPred server adopts a machine-learning approach to perform function prediction in protein feature space using feature characteristics predicted from amino acid sequence. The features are scanned against a library of support vector machines representing over 300 Gene Ontology (GO) classes and probabilistic confidence scores returned for each annotation term. The GO term library has been modelled on human protein annotations; however, benchmark performance testing showed robust performance across higher eukaryotes. FFPred offers important advantages over traditional function prediction servers in its ability to annotate distant homologues and orphan protein sequences, and achieves greater coverage and classification accuracy than other feature-based prediction servers. A user may upload an amino acid and receive annotation predictions via email. Feature information is provided as easy to interpret graphics displayed on the sequence of interest, allowing for back-interpretation of the associations between features and function classes
    • ā€¦
    corecore