3,171 research outputs found

    ProtNN: Fast and Accurate Nearest Neighbor Protein Function Prediction based on Graph Embedding in Structural and Topological Space

    Full text link
    Studying the function of proteins is important for understanding the molecular mechanisms of life. The number of publicly available protein structures has increasingly become extremely large. Still, the determination of the function of a protein structure remains a difficult, costly, and time consuming task. The difficulties are often due to the essential role of spatial and topological structures in the determination of protein functions in living cells. In this paper, we propose ProtNN, a novel approach for protein function prediction. Given an unannotated protein structure and a set of annotated proteins, ProtNN finds the nearest neighbor annotated structures based on protein-graph pairwise similarities. Given a query protein, ProtNN finds the nearest neighbor reference proteins based on a graph representation model and a pairwise similarity between vector embedding of both query and reference protein-graphs in structural and topological spaces. ProtNN assigns to the query protein the function with the highest number of votes across the set of k nearest neighbor reference proteins, where k is a user-defined parameter. Experimental evaluation demonstrates that ProtNN is able to accurately classify several datasets in an extremely fast runtime compared to state-of-the-art approaches. We further show that ProtNN is able to scale up to a whole PDB dataset in a single-process mode with no parallelization, with a gain of thousands order of magnitude of runtime compared to state-of-the-art approaches

    A spatio-temporal mining approach towards summarizing and analyzing protein folding trajectories

    Get PDF
    Understanding the protein folding mechanism remains a grand challenge in structural biology. In the past several years, computational theories in molecular dynamics have been employed to shed light on the folding process. Coupled with high computing power and large scale storage, researchers now can computationally simulate the protein folding process in atomistic details at femtosecond temporal resolution. Such simulation often produces a large number of folding trajectories, each consisting of a series of 3D conformations of the protein under study. As a result, effectively managing and analyzing such trajectories is becoming increasingly important. In this article, we present a spatio-temporal mining approach to analyze protein folding trajectories. It exploits the simplicity of contact maps, while also integrating 3D structural information in the analysis. It characterizes the dynamic folding process by first identifying spatio-temporal association patterns in contact maps, then studying how such patterns evolve along a folding trajectory. We demonstrate that such patterns can be leveraged to summarize folding trajectories, and to facilitate the detection and ordering of important folding events along a folding path. We also show that such patterns can be used to identify a consensus partial folding pathway across multiple folding trajectories. Furthermore, we argue that such patterns can capture both local and global structural topology in a 3D protein conformation, thereby facilitating effective structural comparison amongst conformations. We apply this approach to analyze the folding trajectories of two small synthetic proteins-BBA5 and GSGS (or Beta3S). We show that this approach is promising towards addressing the above issues, namely, folding trajectory summarization, folding events detection and ordering, and consensus partial folding pathway identification across trajectories

    Towards comprehensive structural motif mining for better fold annotation in the "twilight zone" of sequence dissimilarity

    Get PDF
    Background: Automatic identification of structure fingerprints from a group of diverse protein structures is challenging, especially for proteins whose divergent amino acid sequences may fall into the “twilight-” or “midnight– ” zones where pair-wise sequence identities to known sequences fall below 25 % and sequence-based functional annotations often fail. Results: Here we report a novel graph database mining method and demonstrate its application to protein structure pattern identification and structure classification. The biologic motivation of our study is to recognize common structure patterns in “immunoevasins”, proteins mediating virus evasion of host immune defense. Our experimental study, using both viral and non-viral proteins, demonstrates the efficiency and efficacy of the proposed method. Conclusions: We present a theoretic framework, offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrices as studied here, and devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we significantly expanded the analytical power of sophisticated data mining algorithms in dealing with large volume of complicated and noisy protein structure data. And without loss of generality, choice of appropriate compatibility matrices allows our method to be easily employed in domains where subgraph labels have some uncertainty

    Towards an Efficient Discovery of Topological Representative Subgraphs

    Get PDF
    National audienceLa sélection de motifs basée sur la similarité structurelle exacte ou approximative est un moyen de réduire le nombre élevé des sous-graphes fréquents. Cependant, les stratégies actuelles de similarité structurelle ne sont pas efficaces dans beaucoup de contextes réels. En outre, la nature combinatoire des graphes rend l'isomorphisme exact ou approximatif très coûteux. Dans ce papier, nous proposons une approche qui permet de sélectionner un sous-ensemble de sous-graphes topologiques représentatifs parmi les fréquents. L'approche proposée surmonte le coûteux test d'isomorphisme exact ou approximatif en mesurant la similarité structurelle globale en se basant sur un ensemble d'attributs topologiques considérés. Elle permet aussi de détecter des similaritées structurelles cachées (tels que la densité, le diamètre, etc.) qui ne sont pas considérées par les approches existantes. En outre, l'approche proposée est flexible et peut être facilement étendue avec des attributs définis par l'utilisateur selon l'application. Les analyses expérimentales sur des bases de graphes réelles et synthétiques montrent l'efficacité de notre approche
    • …
    corecore