126,758 research outputs found

    ProLanGO: Protein Function Prediction Using Neural~Machine Translation Based on a Recurrent Neural Network

    Full text link
    With the development of next generation sequencing techniques, it is fast and cheap to determine protein sequences but relatively slow and expensive to extract useful information from protein sequences because of limitations of traditional biological experimental techniques. Protein function prediction has been a long standing challenge to fill the gap between the huge amount of protein sequences and the known function. In this paper, we propose a novel method to convert the protein function problem into a language translation problem by the new proposed protein sequence language "ProLan" to the protein function language "GOLan", and build a neural machine translation model based on recurrent neural networks to translate "ProLan" language to "GOLan" language. We blindly tested our method by attending the latest third Critical Assessment of Function Annotation (CAFA 3) in 2016, and also evaluate the performance of our methods on selected proteins whose function was released after CAFA competition. The good performance on the training and testing datasets demonstrates that our new proposed method is a promising direction for protein function prediction. In summary, we first time propose a method which converts the protein function prediction problem to a language translation problem and applies a neural machine translation model for protein function prediction.Comment: 13 pages, 5 figure

    Interaction site prediction by structural similarity to neighboring clusters in protein-protein interaction networks

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recently, revealing the function of proteins with protein-protein interaction (PPI) networks is regarded as one of important issues in bioinformatics. With the development of experimental methods such as the yeast two-hybrid method, the data of protein interaction have been increasing extremely. Many databases dealing with these data comprehensively have been constructed and applied to analyzing PPI networks. However, few research on prediction interaction sites using both PPI networks and the 3D protein structures complementarily has explored.</p> <p>Results</p> <p>We propose a method of predicting interaction sites in proteins with unknown function by using both of PPI networks and protein structures. For a protein with unknown function as a target, several clusters are extracted from the neighboring proteins based on their structural similarity. Then, interaction sites are predicted by extracting similar sites from the group of a protein cluster and the target protein. Moreover, the proposed method can improve the prediction accuracy by introducing repetitive prediction process.</p> <p>Conclusions</p> <p>The proposed method has been applied to small scale dataset, then the effectiveness of the method has been confirmed. The challenge will now be to apply the method to large-scale datasets.</p

    K-Partite cliques of protein interactions: A novel subgraph topology for functional coherence analysis on PPI networks

    Full text link
    Many studies are aimed at identifying dense clusters/subgraphs from protein-protein interaction (PPI) networks for protein function prediction. However, the prediction performance based on the dense clusters is actually worse than a simple guilt-by-association method using neighbor counting ideas. This indicates that the local topological structures and properties of PPI networks are still open to new theoretical investigation and empirical exploration. We introduce a novel topological structure called k-partite cliques of protein interactions-a functionally coherent but not-necessarily dense subgraph topology in PPI networks-to study PPI networks. A k-partite protein clique is a maximal k-partite clique comprising two or more nonoverlapping protein subsets between any two of which full interactions are exhibited. In the detection of PPI's maximal k-partite cliques, we propose to transform PPI networks into induced K-partite graphs where edges exist only between the partites. Then, we present a maximal k-partite clique mining (MaCMik) algorithm to enumerate maximal k-partite cliques from K-partite graphs. Our MaCMik algorithm is then applied to a yeast PPI network. We observed interesting and unusually high functional coherence in k-partite protein cliques-the majority of the proteins in k-partite protein cliques, especially those in the same partites, share the same functions, although k-partite protein cliques are not restricted to be dense compared with dense subgraph patterns or (quasi-)cliques. The idea of k-partite protein cliques provides a novel approach of characterizing PPI networks, and so it will help function prediction for unknown proteins.© 2013 Elsevier Ltd

    Diffusion Component Analysis: Unraveling Functional Topology in Biological Networks

    Full text link
    Complex biological systems have been successfully modeled by biochemical and genetic interaction networks, typically gathered from high-throughput (HTP) data. These networks can be used to infer functional relationships between genes or proteins. Using the intuition that the topological role of a gene in a network relates to its biological function, local or diffusion based "guilt-by-association" and graph-theoretic methods have had success in inferring gene functions. Here we seek to improve function prediction by integrating diffusion-based methods with a novel dimensionality reduction technique to overcome the incomplete and noisy nature of network data. In this paper, we introduce diffusion component analysis (DCA), a framework that plugs in a diffusion model and learns a low-dimensional vector representation of each node to encode the topological properties of a network. As a proof of concept, we demonstrate DCA's substantial improvement over state-of-the-art diffusion-based approaches in predicting protein function from molecular interaction networks. Moreover, our DCA framework can integrate multiple networks from heterogeneous sources, consisting of genomic information, biochemical experiments and other resources, to even further improve function prediction. Yet another layer of performance gain is achieved by integrating the DCA framework with support vector machines that take our node vector representations as features. Overall, our DCA framework provides a novel representation of nodes in a network that can be used as a plug-in architecture to other machine learning algorithms to decipher topological properties of and obtain novel insights into interactomes.Comment: RECOMB 201

    Ligand Similarity Complements Sequence, Physical Interaction, and Co-Expression for Gene Function Prediction

    Get PDF
    The expansion of protein-ligand annotation databases has enabled large-scale networking of proteins by ligand similarity. These ligand-based protein networks, which implicitly predict the ability of neighboring proteins to bind related ligands, may complement biologically-oriented gene networks, which are used to predict functional or disease relevance. To quantify the degree to which such ligand-based protein associations might complement functional genomic associations, including sequence similarity, physical protein-protein interactions, co-expression, and disease gene annotations, we calculated a network based on the Similarity Ensemble Approach (SEA: sea.docking.org), where protein neighbors reflect the similarity of their ligands. We also measured the similarity with functional genomic networks over a common set of 1,131 genes, and found that the networks had only small overlaps, which were significant only due to the large scale of the data. Consistent with the view that the networks contain different information, combining them substantially improved Molecular Function prediction within GO (from AUROC~0.63-0.75 for the individual data modalities to AUROC~0.8 in the aggregate). We investigated the boost in guilt-by-association gene function prediction when the networks are combined and describe underlying properties that can be further exploited

    A biophysical approach to large-scale protein-DNA binding data

    Get PDF
    About this book * Cutting-edge genome analysis methods from leading bioinformaticians An accurate description of current scientific developments in the field of bioinformatics and computational implementation is presented by research of the BioSapiens Network of Excellence. Bioinformatics is essential for annotating the structure and function of genes, proteins and the analysis of complete genomes and to molecular biology and biochemistry. Included is an overview of bioinformatics, the full spectrum of genome annotation approaches including; genome analysis and gene prediction, gene regulation analysis and expression, genome variation and QTL analysis, large scale protein annotation of function and structure, annotation and prediction of protein interactions, and the organization and annotation of molecular networks and biochemical pathways. Also covered is a technical framework to organize and represent genome data using the DAS technology and work in the annotation of two large genomic sets: HIV/HCV viral genomes and splicing alternatives potentially encoded in 1% of the human genome

    Bayesian Markov Random Field Analysis for Protein Function Prediction Based on Network Data

    Get PDF
    Inference of protein functions is one of the most important aims of modern biology. To fully exploit the large volumes of genomic data typically produced in modern-day genomic experiments, automated computational methods for protein function prediction are urgently needed. Established methods use sequence or structure similarity to infer functions but those types of data do not suffice to determine the biological context in which proteins act. Current high-throughput biological experiments produce large amounts of data on the interactions between proteins. Such data can be used to infer interaction networks and to predict the biological process that the protein is involved in. Here, we develop a probabilistic approach for protein function prediction using network data, such as protein-protein interaction measurements. We take a Bayesian approach to an existing Markov Random Field method by performing simultaneous estimation of the model parameters and prediction of protein functions. We use an adaptive Markov Chain Monte Carlo algorithm that leads to more accurate parameter estimates and consequently to improved prediction performance compared to the standard Markov Random Fields method. We tested our method using a high quality S.cereviciae validation network with 1622 proteins against 90 Gene Ontology terms of different levels of abstraction. Compared to three other protein function prediction methods, our approach shows very good prediction performance. Our method can be directly applied to protein-protein interaction or coexpression networks, but also can be extended to use multiple data sources. We apply our method to physical protein interaction data from S. cerevisiae and provide novel predictions, using 340 Gene Ontology terms, for 1170 unannotated proteins and we evaluate the predictions using the available literature

    The topology of the bacterial co-conserved protein network and its implications for predicting protein function

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein-protein interactions networks are most often generated from physical protein-protein interaction data. Co-conservation, also known as phylogenetic profiles, is an alternative source of information for generating protein interaction networks. Co-conservation methods generate interaction networks among proteins that are gained or lost together through evolution. Co-conservation is a particularly useful technique in the compact bacteria genomes. Prior studies in yeast suggest that the topology of protein-protein interaction networks generated from physical interaction assays can offer important insight into protein function. Here, we hypothesize that in bacteria, the topology of protein interaction networks derived via co-conservation information could similarly improve methods for predicting protein function. Since the topology of bacteria co-conservation protein-protein interaction networks has not previously been studied in depth, we first perform such an analysis for co-conservation networks in <it>E. coli </it>K12. Next, we demonstrate one way in which network connectivity measures and global and local function distribution can be exploited to predict protein function for previously uncharacterized proteins.</p> <p>Results</p> <p>Our results showed, like most biological networks, our bacteria co-conserved protein-protein interaction networks had scale-free topologies. Our results indicated that some properties of the physical yeast interaction network hold in our bacteria co-conservation networks, such as high connectivity for essential proteins. However, the high connectivity among protein complexes in the yeast physical network was not seen in the co-conservation network which uses all bacteria as the reference set. We found that the distribution of node connectivity varied by functional category and could be informative for function prediction. By integrating of functional information from different annotation sources and using the network topology, we were able to infer function for uncharacterized proteins.</p> <p>Conclusion</p> <p>Interactions networks based on co-conservation can contain information distinct from networks based on physical or other interaction types. Our study has shown co-conservation based networks to exhibit a scale free topology, as expected for biological networks. We also revealed ways that connectivity in our networks can be informative for the functional characterization of proteins.</p
    corecore