241 research outputs found

    Methods for protein complex prediction and their contributions towards understanding the organization, function and dynamics of complexes

    Get PDF
    Complexes of physically interacting proteins constitute fundamental functional units responsible for driving biological processes within cells. A faithful reconstruction of the entire set of complexes is therefore essential to understand the functional organization of cells. In this review, we discuss the key contributions of computational methods developed till date (approximately between 2003 and 2015) for identifying complexes from the network of interacting proteins (PPI network). We evaluate in depth the performance of these methods on PPI datasets from yeast, and highlight challenges faced by these methods, in particular detection of sparse and small or sub- complexes and discerning of overlapping complexes. We describe methods for integrating diverse information including expression profiles and 3D structures of proteins with PPI networks to understand the dynamics of complex formation, for instance, of time-based assembly of complex subunits and formation of fuzzy complexes from intrinsically disordered proteins. Finally, we discuss methods for identifying dysfunctional complexes in human diseases, an application that is proving invaluable to understand disease mechanisms and to discover novel therapeutic targets. We hope this review aptly commemorates a decade of research on computational prediction of complexes and constitutes a valuable reference for further advancements in this exciting area.Comment: 1 Tabl

    Identifying protein complexes and disease genes from biomolecular networks

    Get PDF
    With advances in high-throughput measurement techniques, large-scale biological data, such as protein-protein interaction (PPI) data, gene expression data, gene-disease association data, cellular pathway data, and so on, have been and will continue to be produced. Those data contain insightful information for understanding the mechanisms of biological systems and have been proved useful for developing new methods in disease diagnosis, disease treatment and drug design. This study focuses on two main research topics: (1) identifying protein complexes and (2) identifying disease genes from biomolecular networks. Firstly, protein complexes are groups of proteins that interact with each other at the same time and place within living cells. They are molecular entities that carry out cellular processes. The identification of protein complexes plays a primary role for understanding the organization of proteins and the mechanisms of biological systems. Many previous algorithms are designed based on the assumption that protein complexes are densely connected sub-graphs in PPI networks. In this research, a dense sub-graph detection algorithm is first developed following this assumption by using clique seeds and graph entropy. Although the proposed algorithm generates a large number of reasonable predictions and its f-score is better than many previous algorithms, it still cannot identify many known protein complexes. After that, we analyze characteristics of known yeast protein complexes and find that not all of the complexes exhibit dense structures in PPI networks. Many of them have a star-like structure, which is a very special case of the core-attachment structure and it cannot be identified by many previous core-attachment-structure-based algorithms. To increase the prediction accuracy of protein complex identification, a multiple-topological-structure-based algorithm is proposed to identify protein complexes from PPI networks. Four single-topological-structure-based algorithms are first employed to detect raw predictions with clique, dense, core-attachment and star-like structures, respectively. A merging and trimming step is then adopted to generate final predictions based on topological information or GO annotations of predictions. A comprehensive review about the identification of protein complexes from static PPI networks to dynamic PPI networks is also given in this study. Secondly, genetic diseases often involve the dysfunction of multiple genes. Various types of evidence have shown that similar disease genes tend to lie close to one another in various biomolecular networks. The identification of disease genes via multiple data integration is indispensable towards the understanding of the genetic mechanisms of many genetic diseases. However, the number of known disease genes related to similar genetic diseases is often small. It is not easy to capture the intricate gene-disease associations from such a small number of known samples. Moreover, different kinds of biological data are heterogeneous and no widely acceptable criterion is available to standardize them to the same scale. In this study, a flexible and reliable multiple data integration algorithm is first proposed to identify disease genes based on the theory of Markov random fields (MRF) and the method of Bayesian analysis. A novel global-characteristic-based parameter estimation method and an improved Gibbs sampling strategy are introduced, such that the proposed algorithm has the capability to tune parameters of different data sources automatically. However, the Markovianity characteristic of the proposed algorithm means it only considers information of direct neighbors to formulate the relationship among genes, ignoring the contribution of indirect neighbors in biomolecular networks. To overcome this drawback, a kernel-based MRF algorithm is further proposed to take advantage of the global characteristics of biological data via graph kernels. The kernel-based MRF algorithm generates predictions better than many previous disease gene identification algorithms in terms of the area under the receiver operating characteristic curve (AUC score). However, it is very time-consuming, since the Gibbs sampling process of the algorithm has to maintain a long Markov chain for every single gene. Finally, to reduce the computational time of the MRF-based algorithm, a fast and high performance logistic-regression-based algorithm is developed for identifying disease genes from biomolecular networks. Numerical experiments show that the proposed algorithm outperforms many existing methods in terms of the AUC score and running time. To summarize, this study has developed several computational algorithms for identifying protein complexes and disease genes from biomolecular networks, respectively. These proposed algorithms are better than many other existing algorithms in the literature

    Iteration Method for Predicting Essential Proteins Based on Orthology and Protein-protein Interaction Networks

    Get PDF
    Background: Identification of essential proteins plays a significant role in understanding minimal requirements for the cellular survival and development. Many computational methods have been proposed for predicting essential proteins by using the topological features of protein-protein interaction (PPI) networks. However, most of these methods ignored intrinsic biological meaning of proteins. Moreover, PPI data contains many false positives and false negatives. To overcome these limitations, recently many research groups have started to focus on identification of essential proteins by integrating PPI networks with other biological information. However, none of their methods has widely been acknowledged. Results: By considering the facts that essential proteins are more evolutionarily conserved than nonessential proteins and essential proteins frequently bind each other, we propose an iteration method for predicting essential proteins by integrating the orthology with PPI networks, named by ION. Differently from other methods, ION identifies essential proteins depending on not only the connections between proteins but also their orthologous properties and features of their neighbors. ION is implemented to predict essential proteins in S. cerevisiae. Experimental results show that ION can achieve higher identification accuracy than eight other existing centrality methods in terms of area under the curve (AUC). Moreover, ION identifies a large amount of essential proteins which have been ignored by eight other existing centrality methods because of their low-connectivity. Many proteins ranked in top 100 by ION are both essential and belong to the complexes with certain biological functions. Furthermore, no matter how many reference organisms were selected, ION outperforms all eight other existing centrality methods. While using as many as possible reference organisms can improve the performance of ION. Additionally, ION also shows good prediction performance in E. coli K-12. Conclusions: The accuracy of predicting essential proteins can be improved by integrating the orthology with PPI networks

    Mining Biological Networks towards Protein complex Detection and Gene-Disease Association

    Get PDF
    Large amounts of biological data are continuously generated nowadays, thanks to the advancements of high-throughput experimental techniques. Mining valuable knowledge from such data still motivates the design of suitable computational methods, to complement the experimental work which is often bound by considerable time and cost requirements. Protein complexes or groups of interacting proteins, are key players in most cellular events. The identification of complexes not only allows to better understand normal biological processes but also to uncover Disease-triggering malfunctions. Ultimately, findings in this research branch can highly enhance the design of effective medical treatments. The aim of this research is to detect protein complexes in protein-protein interaction networks and to associate the detected entities to diseases. The work is divided into three main objectives: first, develop a suitable method for the identification of protein complexes in static interaction networks; second, model the dynamic aspect of protein interaction networks and detect complexes accordingly; and third, design a learning model to link proteins, and subsequently protein complexes, to diseases. In response to these objectives, we present, ProRank+, a novel complex-detection approach based on a ranking algorithm and a merging procedure. Then, we introduce DyCluster, which uses gene expression data, to model the dynamics of the interaction networks, and we adapt the detection algorithm accordingly. Finally, we integrate network topology attributes and several biological features of proteins to form a classification model for gene-disease association. The reliability of the proposed methods is supported by various experimental studies conducted to compare them with existing approaches. Pro Rank+ detects more protein complexes than other state-of-the-art methods. DyCluster goes a step further and achieves a better performance than similar techniques. Then, our learning model shows that combining topological and biological features can greatly enhance the gene-disease association process. Finally, we present a comprehensive case study of breast cancer in which we pinpoint disease genes using our learning model; subsequently, we detect favorable groupings of those genes in a protein interaction network using the Pro-rank+ algorithm

    Network-based approaches to explore complex biological systems towards network medicine

    Get PDF
    Network medicine relies on different types of networks: from the molecular level of protein–protein interactions to gene regulatory network and correlation studies of gene expression. Among network approaches based on the analysis of the topological properties of protein–protein interaction (PPI) networks, we discuss the widespread DIAMOnD (disease module detection) algorithm. Starting from the assumption that PPI networks can be viewed as maps where diseases can be identified with localized perturbation within a specific neighborhood (i.e., disease modules), DIAMOnD performs a systematic analysis of the human PPI network to uncover new disease-associated genes by exploiting the connectivity significance instead of connection density. The past few years have witnessed the increasing interest in understanding the molecular mechanism of post-transcriptional regulation with a special emphasis on non-coding RNAs since they are emerging as key regulators of many cellular processes in both physiological and pathological states. Recent findings show that coding genes are not the only targets that microRNAs interact with. In fact, there is a pool of different RNAs—including long non-coding RNAs (lncRNAs) —competing with each other to attract microRNAs for interactions, thus acting as competing endogenous RNAs (ceRNAs). The framework of regulatory networks provides a powerful tool to gather new insights into ceRNA regulatory mechanisms. Here, we describe a data-driven model recently developed to explore the lncRNA-associated ceRNA activity in breast invasive carcinoma. On the other hand, a very promising example of the co-expression network is the one implemented by the software SWIM (switch miner), which combines topological properties of correlation networks with gene expression data in order to identify a small pool of genes—called switch genes—critically associated with drastic changes in cell phenotype. Here, we describe SWIM tool along with its applications to cancer research and compare its predictions with DIAMOnD disease genes

    Innovative Algorithms and Evaluation Methods for Biological Motif Finding

    Get PDF
    Biological motifs are defined as overly recurring sub-patterns in biological systems. Sequence motifs and network motifs are the examples of biological motifs. Due to the wide range of applications, many algorithms and computational tools have been developed for efficient search for biological motifs. Therefore, there are more computationally derived motifs than experimentally validated motifs, and how to validate the biological significance of the ‘candidate motifs’ becomes an important question. Some of sequence motifs are verified by their structural similarities or their functional roles in DNA or protein sequences, and stored in databases. However, biological role of network motifs is still invalidated and currently no databases exist for this purpose. In this thesis, we focus not only on the computational efficiency but also on the biological meanings of the motifs. We provide an efficient way to incorporate biological information with clustering analysis methods: For example, a sparse nonnegative matrix factorization (SNMF) method is used with Chou-Fasman parameters for the protein motif finding. Biological network motifs are searched by various clustering algorithms with Gene ontology (GO) information. Experimental results show that the algorithms perform better than existing algorithms by producing a larger number of high-quality of biological motifs. In addition, we apply biological network motifs for the discovery of essential proteins. Essential proteins are defined as a minimum set of proteins which are vital for development to a fertile adult and in a cellular life in an organism. We design a new centrality algorithm with biological network motifs, named MCGO, and score proteins in a protein-protein interaction (PPI) network to find essential proteins. MCGO is also combined with other centrality measures to predict essential proteins using machine learning techniques. We have three contributions to the study of biological motifs through this thesis; 1) Clustering analysis is efficiently used in this work and biological information is easily integrated with the analysis; 2) We focus more on the biological meanings of motifs by adding biological knowledge in the algorithms and by suggesting biologically related evaluation methods. 3) Biological network motifs are successfully applied to a practical application of prediction of essential proteins

    A Special Structural Based Weighted Network Approach for the Analysis of Protein Complexes

    Get PDF
    The detection and analysis of protein complexes is essential for understanding the functional mechanism and cellular integrity. Recently, several techniques for detecting and analysing protein complexes from Protein–Protein Interaction (PPI) dataset have been developed. Most of those techniques are inefficient in terms of detecting, overlapping complexes, exclusion of attachment protein in complex core, inability to detect inherent structures of underlying complexes, have high false-positive rates and an enrichment analysis. To address these limitations, we introduce a special structural-based weighted network approach for the analysis of protein complexes based on a Weighted Edge, Core-Attachment and Local Modularity structures (WECALM). Experimental results indicate that WECALM performs relatively better than existing algorithms in terms of accuracy, computational time, and p-value. A functional enrichment analysis also shows that WECALM is able to identify a large number of biologically significant protein complexes. Overall, WECALM outperforms other approaches by striking a better balance of accuracy and efficiency in the detection of protein complexes

    Compact Integration of Multi-Network Topology for Functional Analysis of Genes

    Get PDF
    The topological landscape of molecular or functional interaction networks provides a rich source of information for inferring functional patterns of genes or proteins. However, a pressing yet-unsolved challenge is how to combine multiple heterogeneous networks, each having different connectivity patterns, to achieve more accurate inference. Here, we describe the Mashup framework for scalable and robust network integration. In Mashup, the diffusion in each network is first analyzed to characterize the topological context of each node. Next, the high-dimensional topological patterns in individual networks are canonically represented using low-dimensional vectors, one per gene or protein. These vectors can then be plugged into off-the-shelf machine learning methods to derive functional insights about genes or proteins. We present tools based on Mashup that achieve state-of-the-art performance in three diverse functional inference tasks: protein function prediction, gene ontology reconstruction, and genetic interaction prediction. Mashup enables deeper insights into the struct ure of rapidly accumulating and diverse biological network data and can be broadly applied to other network science domains. Keywords: interactome analysis; network integration; heterogeneous networks; dimensionality reduction; network diffusion; gene function prediction; genetic interaction prediction; gene ontology reconstruction; drug response predictionNational Institutes of Health (U.S.) (Grant R01GM081871

    Rechecking the Centrality-Lethality Rule in the Scope of Protein Subcellular Localization Interaction Networks

    Get PDF
    Essential proteins are indispensable for living organisms to maintain life activities and play important roles in the studies of pathology, synthetic biology, and drug design. Therefore, besides experiment methods, many computational methods are proposed to identify essential proteins. Based on the centrality-lethality rule, various centrality methods are employed to predict essential proteins in a Protein-protein Interaction Network (PIN). However, neglecting the temporal and spatial features of protein-protein interactions, the centrality scores calculated by centrality methods are not effective enough for measuring the essentiality of proteins in a PIN. Moreover, many methods, which overfit with the features of essential proteins for one species, may perform poor for other species. In this paper, we demonstrate that the centrality-lethality rule also exists in Protein Subcellular Localization Interaction Networks (PSLINs). To do this, a method based on Localization Specificity for Essential protein Detection (LSED), was proposed, which can be combined with any centrality method for calculating the improved centrality scores by taking into consideration PSLINs in which proteins play their roles. In this study, LSED was combined with eight centrality methods separately to calculate Localization-specific Centrality Scores (LCSs) for proteins based on the PSLINs of four species (Saccharomyces cerevisiae, Homo sapiens, Mus musculus and Drosophila melanogaster). Compared to the proteins with high centrality scores measured from the global PINs, more proteins with high LCSs measured from PSLINs are essential. It indicates that proteins with high LCSs measured from PSLINs are more likely to be essential and the performance of centrality methods can be improved by LSED. Furthermore, LSED provides a wide applicable prediction model to identify essential proteins for different species
    • …
    corecore