349 research outputs found

    ProtNN: Fast and Accurate Nearest Neighbor Protein Function Prediction based on Graph Embedding in Structural and Topological Space

    Full text link
    Studying the function of proteins is important for understanding the molecular mechanisms of life. The number of publicly available protein structures has increasingly become extremely large. Still, the determination of the function of a protein structure remains a difficult, costly, and time consuming task. The difficulties are often due to the essential role of spatial and topological structures in the determination of protein functions in living cells. In this paper, we propose ProtNN, a novel approach for protein function prediction. Given an unannotated protein structure and a set of annotated proteins, ProtNN finds the nearest neighbor annotated structures based on protein-graph pairwise similarities. Given a query protein, ProtNN finds the nearest neighbor reference proteins based on a graph representation model and a pairwise similarity between vector embedding of both query and reference protein-graphs in structural and topological spaces. ProtNN assigns to the query protein the function with the highest number of votes across the set of k nearest neighbor reference proteins, where k is a user-defined parameter. Experimental evaluation demonstrates that ProtNN is able to accurately classify several datasets in an extremely fast runtime compared to state-of-the-art approaches. We further show that ProtNN is able to scale up to a whole PDB dataset in a single-process mode with no parallelization, with a gain of thousands order of magnitude of runtime compared to state-of-the-art approaches

    Towards an Efficient Discovery of the Topological Representative Subgraphs

    Full text link
    With the emergence of graph databases, the task of frequent subgraph discovery has been extensively addressed. Although the proposed approaches in the literature have made this task feasible, the number of discovered frequent subgraphs is still very high to be efficiently used in any further exploration. Feature selection for graph data is a way to reduce the high number of frequent subgraphs based on exact or approximate structural similarity. However, current structural similarity strategies are not efficient enough in many real-world applications, besides, the combinatorial nature of graphs makes it computationally very costly. In order to select a smaller yet structurally irredundant set of subgraphs, we propose a novel approach that mines the top-k topological representative subgraphs among the frequent ones. Our approach allows detecting hidden structural similarities that existing approaches are unable to detect such as the density or the diameter of the subgraph. In addition, it can be easily extended using any user defined structural or topological attributes depending on the sought properties. Empirical studies on real and synthetic graph datasets show that our approach is fast and scalable

    Discovery of Functional Motifs from the Interface Region of Oligomeric Proteins using Frequent Subgraph Mining

    Get PDF
    Modeling the interface region of a protein complex paves the way for understanding its dynamics and functionalities. Existing works model the interface region of a complex by using different approaches, such as, the residue composition at the interface region, the geometry of the interface residues, or the structural alignment of interface regions. These approaches are useful for ranking a set of docked conformation or for building scoring function for protein-protein docking, but they do not provide a generic and scalable technique for the extraction of interface patterns leading to functional motif discovery. In this work, we model the interface region of a protein complex by graphs and extract interface patterns of the given complex in the form of frequent subgraphs. To achieve this we develop a scalable algorithm for frequent subgraph mining. We show that a systematic review of the mined subgraphs provides an effective method for the discovery of functional motifs that exist along the interface region of a given protein complex

    Towards an Efficient Discovery of Topological Representative Subgraphs

    Get PDF
    National audienceLa sĂ©lection de motifs basĂ©e sur la similaritĂ© structurelle exacte ou approximative est un moyen de rĂ©duire le nombre Ă©levĂ© des sous-graphes frĂ©quents. Cependant, les stratĂ©gies actuelles de similaritĂ© structurelle ne sont pas efficaces dans beaucoup de contextes rĂ©els. En outre, la nature combinatoire des graphes rend l'isomorphisme exact ou approximatif trĂšs coĂ»teux. Dans ce papier, nous proposons une approche qui permet de sĂ©lectionner un sous-ensemble de sous-graphes topologiques reprĂ©sentatifs parmi les frĂ©quents. L'approche proposĂ©e surmonte le coĂ»teux test d'isomorphisme exact ou approximatif en mesurant la similaritĂ© structurelle globale en se basant sur un ensemble d'attributs topologiques considĂ©rĂ©s. Elle permet aussi de dĂ©tecter des similaritĂ©es structurelles cachĂ©es (tels que la densitĂ©, le diamĂštre, etc.) qui ne sont pas considĂ©rĂ©es par les approches existantes. En outre, l'approche proposĂ©e est flexible et peut ĂȘtre facilement Ă©tendue avec des attributs dĂ©finis par l'utilisateur selon l'application. Les analyses expĂ©rimentales sur des bases de graphes rĂ©elles et synthĂ©tiques montrent l'efficacitĂ© de notre approche

    Predicting conserved protein motifs with Sub-HMMs

    Get PDF
    BackgroundProfile HMMs (hidden Markov models) provide effective methods for modeling the conserved regions of protein families. A limitation of the resulting domain models is the difficulty to pinpoint their much shorter functional sub-features, such as catalytically relevant sequence motifs in enzymes or ligand binding signatures of receptor proteins.ResultsTo identify these conserved motifs efficiently, we propose a method for extracting the most information-rich regions in protein families from their profile HMMs. The method was used here to predict a comprehensive set of sub-HMMs from the Pfam domain database. Cross-validations with the PROSITE and CSA databases confirmed the efficiency of the method in predicting most of the known functionally relevant motifs and residues. At the same time, 46,768 novel conserved regions could be predicted. The data set also allowed us to link at least 461 Pfam domains of known and unknown function by their common sub-HMMs. Finally, the sub-HMM method showed very promising results as an alternative search method for identifying proteins that share only short sequence similarities.ConclusionsSub-HMMs extend the application spectrum of profile HMMs to motif discovery. Their most interesting utility is the identification of the functionally relevant residues in proteins of known and unknown function. Additionally, sub-HMMs can be used for highly localized sequence similarity searches that focus on shorter conserved features rather than entire domains or global similarities. The motif data generated by this study is a valuable knowledge resource for characterizing protein functions in the future

    Latent Representation and Sampling in Network: Application in Text Mining and Biology.

    Get PDF
    In classical machine learning, hand-designed features are used for learning a mapping from raw data. However, human involvement in feature design makes the process expensive. Representation learning aims to learn abstract features directly from data without direct human involvement. Raw data can be of various forms. Network is one form of data that encodes relational structure in many real-world domains. Therefore, learning abstract features for network units is an important task. In this dissertation, we propose models for incorporating temporal information given as a collection of networks from subsequent time-stamps. The primary objective of our models is to learn a better abstract feature representation of nodes and edges in an evolving network. We show that the temporal information in the abstract feature improves the performance of link prediction task substantially. Besides applying to the network data, we also employ our models to incorporate extra-sentential information in the text domain for learning better representation of sentences. We build a context network of sentences to capture extra-sentential information. This information in abstract feature representation of sentences improves various text-mining tasks substantially over a set of baseline methods. A problem with the abstract features that we learn is that they lack interpretability. In real-life applications on network data, for some tasks, it is crucial to learn interpretable features in the form of graphical structures. For this we need to mine important graphical structures along with their frequency statistics from the input dataset. However, exact algorithms for these tasks are computationally expensive, so scalable algorithms are of urgent need. To overcome this challenge, we provide efficient sampling algorithms for mining higher-order structures from network(s). We show that our sampling-based algorithms are scalable. They are also superior to a set of baseline algorithms in terms of retrieving important graphical sub-structures, and collecting their frequency statistics. Finally, we show that we can use these frequent subgraph statistics and structures as features in various real-life applications. We show one application in biology and another in security. In both cases, we show that the structures and their statistics significantly improve the performance of knowledge discovery tasks in these domains

    Ab initio RNA folding

    Full text link
    RNA molecules are essential cellular machines performing a wide variety of functions for which a specific three-dimensional structure is required. Over the last several years, experimental determination of RNA structures through X-ray crystallography and NMR seems to have reached a plateau in the number of structures resolved each year, but as more and more RNA sequences are being discovered, need for structure prediction tools to complement experimental data is strong. Theoretical approaches to RNA folding have been developed since the late nineties when the first algorithms for secondary structure prediction appeared. Over the last 10 years a number of prediction methods for 3D structures have been developed, first based on bioinformatics and data-mining, and more recently based on a coarse-grained physical representation of the systems. In this review we are going to present the challenges of RNA structure prediction and the main ideas behind bioinformatic approaches and physics-based approaches. We will focus on the description of the more recent physics-based phenomenological models and on how they are built to include the specificity of the interactions of RNA bases, whose role is critical in folding. Through examples from different models, we will point out the strengths of physics-based approaches, which are able not only to predict equilibrium structures, but also to investigate dynamical and thermodynamical behavior, and the open challenges to include more key interactions ruling RNA folding.Comment: 28 pages, 18 figure
    • 

    corecore