725 research outputs found

    Discover Protein Complexes in Protein-Protein Interaction Networks Using Parametric Local Modularity

    Get PDF
    Abstract Background Recent advances in proteomic technologies have enabled us to create detailed protein-protein interaction maps in multiple species and in both normal and diseased cells. As the size of the interaction dataset increases, powerful computational methods are required in order to effectively distil network models from large-scale interactome data. Results We present an algorithm, miPALM (Module Inference by Parametric Local Modularity), to infer protein complexes in a protein-protein interaction network. The algorithm uses a novel graph theoretic measure, parametric local modularity, to identify highly connected sub-networks as candidate protein complexes. Using gold standard sets of protein complexes and protein function and localization annotations, we show our algorithm achieved an overall improvement over previous algorithms in terms of precision, recall, and biological relevance of the predicted complexes. We applied our algorithm to predict and characterize a set of 138 novel protein complexes in S. cerevisiae. Conclusions miPALM is a novel algorithm for detecting protein complexes from large protein-protein interaction networks with improved accuracy than previous methods. The software is implemented in Matlab and is freely available at http://www.medicine.uiowa.edu/Labs/tan/software.html.</p

    Increased entropy of signal transduction in the cancer metastasis phenotype

    Get PDF
    Studies into the statistical properties of biological networks have led to important biological insights, such as the presence of hubs and hierarchical modularity. There is also a growing interest in studying the statistical properties of networks in the context of cancer genomics. However, relatively little is known as to what network features differ between the cancer and normal cell physiologies, or between different cancer cell phenotypes. Based on the observation that frequent genomic alterations underlie a more aggressive cancer phenotype, we asked if such an effect could be detectable as an increase in the randomness of local gene expression patterns. Using a breast cancer gene expression data set and a model network of protein interactions we derive constrained weighted networks defined by a stochastic information flux matrix reflecting expression correlations between interacting proteins. Based on this stochastic matrix we propose and compute an entropy measure that quantifies the degree of randomness in the local pattern of information flux around single genes. By comparing the local entropies in the non-metastatic versus metastatic breast cancer networks, we here show that breast cancers that metastasize are characterised by a small yet significant increase in the degree of randomness of local expression patterns. We validate this result in three additional breast cancer expression data sets and demonstrate that local entropy better characterises the metastatic phenotype than other non-entropy based measures. We show that increases in entropy can be used to identify genes and signalling pathways implicated in breast cancer metastasis. Further exploration of such integrated cancer expression and protein interaction networks will therefore be a fruitful endeavour.Comment: 5 figures, 2 Supplementary Figures and Table

    A survey of statistical network models

    Full text link
    Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology from the 1960s, these early works generated an active network community and a substantial literature in the 1970s. This effort moved into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning network literature in statistical physics and computer science. The growth of the World Wide Web and the emergence of online networking communities such as Facebook, MySpace, and LinkedIn, and a host of more specialized professional network communities has intensified interest in the study of networks and network data. Our goal in this review is to provide the reader with an entry point to this burgeoning literature. We begin with an overview of the historical development of statistical network modeling and then we introduce a number of examples that have been studied in the network literature. Our subsequent discussion focuses on a number of prominent static and dynamic network models and their interconnections. We emphasize formal model descriptions, and pay special attention to the interpretation of parameters and their estimation. We end with a description of some open problems and challenges for machine learning and statistics.Comment: 96 pages, 14 figures, 333 reference

    Cooperative co-evolutionary module identification with application to cancer disease module discovery

    Get PDF
    none10siModule identification or community detection in complex networks has become increasingly important in many scientific fields because it provides insight into the relationship and interaction between network function and topology. In recent years, module identification algorithms based on stochastic optimization algorithms such as evolutionary algorithms have been demonstrated to be superior to other algorithms on small- to medium-scale networks. However, the scalability and resolution limit (RL) problems of these module identification algorithms have not been fully addressed, which impeded their application to real-world networks. This paper proposes a novel module identification algorithm called cooperative co-evolutionary module identification to address these two problems. The proposed algorithm employs a cooperative co-evolutionary framework to handle large-scale networks. We also incorporate a recursive partitioning scheme into the algorithm to effectively address the RL problem. The performance of our algorithm is evaluated on 12 benchmark complex networks. As a medical application, we apply our algorithm to identify disease modules that differentiate low- and high-grade glioma tumors to gain insights into the molecular mechanisms that underpin the progression of glioma. Experimental results show that the proposed algorithm has a very competitive performance compared with other state-of-the-art module identification algorithms.noneHe, S and Jia, G and Zhu, Z and Tennant, DA and Huang, Q and Tang, K and Liu, J and Musolesi, M and Heath, JK and Yao, XHe, S and Jia, G and Zhu, Z and Tennant, DA and Huang, Q and Tang, K and Liu, J and Musolesi, M and Heath, JK and Yao,

    Frequent Pattern Finding in Integrated Biological Networks

    Get PDF
    Biomedical research is undergoing a revolution with the advance of high-throughput technologies. A major challenge in the post-genomic era is to understand how genes, proteins and small molecules are organized into signaling pathways and regulatory networks. To simplify the analysis of large complex molecular networks, strategies are sought to break them down into small yet relatively independent network modules, e.g. pathways and protein complexes. In fulfillment of the motivation to find evolutionary origins of network modules, a novel strategy has been developed to uncover duplicated pathways and protein complexes. This search was first formulated into a computational problem which finds frequent patterns in integrated graphs. The whole framework was then successfully implemented as the software package BLUNT, which includes a parallelized version. To evaluate the biological significance of the work, several large datasets were chosen, with each dataset targeting a different biological question. An application of BLUNT was performed on the yeast protein-protein interaction network, which is described. A large number of frequent patterns were discovered and predicted to be duplicated pathways. To explore how these pathways may have diverged since duplication, the differential regulation of duplicated pathways was studied at the transcriptional level, both in terms of time and location. As demonstrated, this algorithm can be used as new data mining tool for large scale biological data in general. It also provides a novel strategy to study the evolution of pathways and protein complexes in a systematic way. Understanding how pathways and protein complexes evolve will greatly benefit the fundamentals of biomedical research

    Identifying protein complexes and disease genes from biomolecular networks

    Get PDF
    With advances in high-throughput measurement techniques, large-scale biological data, such as protein-protein interaction (PPI) data, gene expression data, gene-disease association data, cellular pathway data, and so on, have been and will continue to be produced. Those data contain insightful information for understanding the mechanisms of biological systems and have been proved useful for developing new methods in disease diagnosis, disease treatment and drug design. This study focuses on two main research topics: (1) identifying protein complexes and (2) identifying disease genes from biomolecular networks. Firstly, protein complexes are groups of proteins that interact with each other at the same time and place within living cells. They are molecular entities that carry out cellular processes. The identification of protein complexes plays a primary role for understanding the organization of proteins and the mechanisms of biological systems. Many previous algorithms are designed based on the assumption that protein complexes are densely connected sub-graphs in PPI networks. In this research, a dense sub-graph detection algorithm is first developed following this assumption by using clique seeds and graph entropy. Although the proposed algorithm generates a large number of reasonable predictions and its f-score is better than many previous algorithms, it still cannot identify many known protein complexes. After that, we analyze characteristics of known yeast protein complexes and find that not all of the complexes exhibit dense structures in PPI networks. Many of them have a star-like structure, which is a very special case of the core-attachment structure and it cannot be identified by many previous core-attachment-structure-based algorithms. To increase the prediction accuracy of protein complex identification, a multiple-topological-structure-based algorithm is proposed to identify protein complexes from PPI networks. Four single-topological-structure-based algorithms are first employed to detect raw predictions with clique, dense, core-attachment and star-like structures, respectively. A merging and trimming step is then adopted to generate final predictions based on topological information or GO annotations of predictions. A comprehensive review about the identification of protein complexes from static PPI networks to dynamic PPI networks is also given in this study. Secondly, genetic diseases often involve the dysfunction of multiple genes. Various types of evidence have shown that similar disease genes tend to lie close to one another in various biomolecular networks. The identification of disease genes via multiple data integration is indispensable towards the understanding of the genetic mechanisms of many genetic diseases. However, the number of known disease genes related to similar genetic diseases is often small. It is not easy to capture the intricate gene-disease associations from such a small number of known samples. Moreover, different kinds of biological data are heterogeneous and no widely acceptable criterion is available to standardize them to the same scale. In this study, a flexible and reliable multiple data integration algorithm is first proposed to identify disease genes based on the theory of Markov random fields (MRF) and the method of Bayesian analysis. A novel global-characteristic-based parameter estimation method and an improved Gibbs sampling strategy are introduced, such that the proposed algorithm has the capability to tune parameters of different data sources automatically. However, the Markovianity characteristic of the proposed algorithm means it only considers information of direct neighbors to formulate the relationship among genes, ignoring the contribution of indirect neighbors in biomolecular networks. To overcome this drawback, a kernel-based MRF algorithm is further proposed to take advantage of the global characteristics of biological data via graph kernels. The kernel-based MRF algorithm generates predictions better than many previous disease gene identification algorithms in terms of the area under the receiver operating characteristic curve (AUC score). However, it is very time-consuming, since the Gibbs sampling process of the algorithm has to maintain a long Markov chain for every single gene. Finally, to reduce the computational time of the MRF-based algorithm, a fast and high performance logistic-regression-based algorithm is developed for identifying disease genes from biomolecular networks. Numerical experiments show that the proposed algorithm outperforms many existing methods in terms of the AUC score and running time. To summarize, this study has developed several computational algorithms for identifying protein complexes and disease genes from biomolecular networks, respectively. These proposed algorithms are better than many other existing algorithms in the literature

    A temporal precedence based clustering method for gene expression microarray data

    Get PDF
    Background: Time-course microarray experiments can produce useful data which can help in understanding the underlying dynamics of the system. Clustering is an important stage in microarray data analysis where the data is grouped together according to certain characteristics. The majority of clustering techniques are based on distance or visual similarity measures which may not be suitable for clustering of temporal microarray data where the sequential nature of time is important. We present a Granger causality based technique to cluster temporal microarray gene expression data, which measures the interdependence between two time-series by statistically testing if one time-series can be used for forecasting the other time-series or not. Results: A gene-association matrix is constructed by testing temporal relationships between pairs of genes using the Granger causality test. The association matrix is further analyzed using a graph-theoretic technique to detect highly connected components representing interesting biological modules. We test our approach on synthesized datasets and real biological datasets obtained for Arabidopsis thaliana. We show the effectiveness of our approach by analyzing the results using the existing biological literature. We also report interesting structural properties of the association network commonly desired in any biological system. Conclusions: Our experiments on synthesized and real microarray datasets show that our approach produces encouraging results. The method is simple in implementation and is statistically traceable at each step. The method can produce sets of functionally related genes which can be further used for reverse-engineering of gene circuits

    Posterior Association Networks and Functional Modules Inferred from Rich Phenotypes of Gene Perturbations

    Get PDF
    Combinatorial gene perturbations provide rich information for a systematic exploration of genetic interactions. Despite successful applications to bacteria and yeast, the scalability of this approach remains a major challenge for higher organisms such as humans. Here, we report a novel experimental and computational framework to efficiently address this challenge by limiting the ‘search space’ for important genetic interactions. We propose to integrate rich phenotypes of multiple single gene perturbations to robustly predict functional modules, which can subsequently be subjected to further experimental investigations such as combinatorial gene silencing. We present posterior association networks (PANs) to predict functional interactions between genes estimated using a Bayesian mixture modelling approach. The major advantage of this approach over conventional hypothesis tests is that prior knowledge can be incorporated to enhance predictive power. We demonstrate in a simulation study and on biological data, that integrating complementary information greatly improves prediction accuracy. To search for significant modules, we perform hierarchical clustering with multiscale bootstrap resampling. We demonstrate the power of the proposed methodologies in applications to Ewing's sarcoma and human adult stem cells using publicly available and custom generated data, respectively. In the former application, we identify a gene module including many confirmed and highly promising therapeutic targets. Genes in the module are also significantly overrepresented in signalling pathways that are known to be critical for proliferation of Ewing's sarcoma cells. In the latter application, we predict a functional network of chromatin factors controlling epidermal stem cell fate. Further examinations using ChIP-seq, ChIP-qPCR and RT-qPCR reveal that the basis of their genetic interactions may arise from transcriptional cross regulation. A Bioconductor package implementing PAN is freely available online at http://bioconductor.org/packages/release/bioc/html/PANR.html
    corecore