Search CORE

Nature Precedings

Simultaneous identification of specifically interacting paralogs and inter-protein contacts by Direct-Coupling Analysis

Author: Baldassi Carlo
Gueudré Thomas
Pagnani Andrea
Weigt Martin
Zamparo Marco
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 01/01/2016
Field of study

Understanding protein-protein interactions is central to our understanding of almost all complex biological processes. Computational tools exploiting rapidly growing genomic databases to characterize protein-protein interactions are urgently needed. Such methods should connect multiple scales from evolutionary conserved interactions between families of homologous proteins, over the identification of specifically interacting proteins in the case of multiple paralogs inside a species, down to the prediction of residues being in physical contact across interaction interfaces. Statistical inference methods detecting residue-residue coevolution have recently triggered considerable progress in using sequence data for quaternary protein structure prediction; they require, however, large joint alignments of homologous protein pairs known to interact. The generation of such alignments is a complex computational task on its own; application of coevolutionary modeling has in turn been restricted to proteins without paralogs, or to bacterial systems with the corresponding coding genes being co-localized in operons. Here we show that the Direct-Coupling Analysis of residue coevolution can be extended to connect the different scales, and simultaneously to match interacting paralogs, to identify inter-protein residue-residue contacts and to discriminate interacting from noninteracting families in a multiprotein system. Our results extend the potential applications of coevolutionary analysis far beyond cases treatable so far.Comment: Main Text 19 pages Supp. Inf. 16 page

Archivio istituzionale della Ricerca - Bocconi

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

BClass: A Bayesian Approach Based on Mixture Models for Clustering and Classification of Heterogeneous Biological Data

Author: Arturo Medrano-Soto
J. Andres Christen
Julio Collado-vides
Publication venue
Publication date
Field of study

Based on mixture models, we present a Bayesian method (called BClass) to classify biological entities (e.g. genes) when variables of quite heterogeneous nature are analyzed. Various statistical distributions are used to model the continuous/categorical data commonly produced by genetic experiments and large-scale genomic projects. We calculate the posterior probability of each entry to belong to each element (group) in the mixture. In this way, an original set of heterogeneous variables is transformed into a set of purely homogeneous characteristics represented by the probabilities of each entry to belong to the groups. The number of groups in the analysis is controlled dynamically by rendering the groups as 'alive' and 'dormant' depending upon the number of entities classified within them. Using standard Metropolis-Hastings and Gibbs sampling algorithms, we constructed a sampler to approximate posterior moments and grouping probabilities. Since this method does not require the definition of similarity measures, it is especially suitable for data mining and knowledge discovery in biological databases. We applied BClass to classify genes in RegulonDB, a database specialized in information about the transcriptional regulation of gene expression in the bacterium Escherichia coli. The classification obtained is consistent with current knowledge and allowed prediction of missing values for a number of genes. BClass is object-oriented and fully programmed in Lisp-Stat. The output grouping probabilities are analyzed and interpreted using graphical (dynamically linked plots) and query-based approaches. We discuss the advantages of using Lisp-Stat as a programming language as well as the problems we faced when the data volume increased exponentially due to the ever-growing number of genomic projects.

Research Papers in Economics

Global Functional Atlas of \u3cem\u3eEscherichia coli\u3c/em\u3e Encompassing Previously Uncharacterized Proteins

Author: Ali Mehrab
Babu Mohan
Butland Gareth
Chandran Shamanta
Christopolous Constantine
Emili Andrew
Eroukova Veronika
Golshani Ashkan
Greenblatt Jack F.
Guao Xinghua
Hu Pingzhao
Janga Sarah Chandra
Moreno-Hagelsieb Gabriel
Musso Gabriela
Nazarians-Armavil Anaies
Nazemof Nazila
Paccanaro Alberto
Phanse Sadhna
Pogoutse Oxana
Wong Peter
Yang Wenhong
Publication venue: Scholars Commons @ Laurier
Publication date: 01/04/2009
Field of study

One-third of the 4,225 protein-coding genes of Escherichia coli K-12 remain functionally unannotated (orphans). Many map to distant clades such as Archaea, suggesting involvement in basic prokaryotic traits, whereas others appear restricted to E. coli, including pathogenic strains. To elucidate the orphans’ biological roles, we performed an extensive proteomic survey using affinity-tagged E. coli strains and generated comprehensive genomic context inferences to derive a high-confidence compendium for virtually the entire proteome consisting of 5,993 putative physical interactions and 74,776 putative functional associations, most of which are novel. Clustering of the respective probabilistic networks revealed putative orphan membership in discrete multiprotein complexes and functional modules together with annotated gene products, whereas a machine-learning strategy based on network integration implicated the orphans in specific biological processes. We provide additional experimental evidence supporting orphan participation in protein synthesis, amino acid metabolism, biofilm formation, motility, and assembly of the bacterial cell envelope. This resource provides a “systems-wide” functional blueprint of a model microbe, with insights into the biological and evolutionary significance of previously uncharacterized proteins

Wilfrid Laurier University

Inter-protein sequence co-evolution predicts known physical interactions in bacterial ribosomes and the trp operon

Author: Feinauer Christoph
Pagnani Andrea
Szurmant Hendrik
Weigt Martin
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2016
Field of study

Interaction between proteins is a fundamental mechanism that underlies virtually all biological processes. Many important interactions are conserved across a large variety of species. The need to maintain interaction leads to a high degree of co-evolution between residues in the interface between partner proteins. The inference of protein-protein interaction networks from the rapidly growing sequence databases is one of the most formidable tasks in systems biology today. We propose here a novel approach based on the Direct-Coupling Analysis of the co-evolution between inter-protein residue pairs. We use ribosomal and trp operon proteins as test cases: For the small resp. large ribosomal subunit our approach predicts protein-interaction partners at a true-positive rate of 70% resp. 90% within the first 10 predictions, with areas of 0.69 resp. 0.81 under the ROC curves for all predictions. In the trp operon, it assigns the two largest interaction scores to the only two interactions experimentally known. On the level of residue interactions we show that for both the small and the large ribosomal subunit our approach predicts interacting residues in the system with a true positive rate of 60% and 85% in the first 20 predictions. We use artificial data to show that the performance of our approach depends crucially on the size of the joint multiple sequence alignments and analyze how many sequences would be necessary for a perfect prediction if the sequences were sampled from the same model that we use for prediction. Given the performance of our approach on the test data we speculate that it can be used to detect new interactions, especially in the light of the rapid growth of available sequence data

Archivio istituzionale della Ricerca - Bocconi

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

FigShare

PORTO Publications Open Repository TOrino

Reconstructing genome-wide regulatory network of E. coli using transcriptome data and predicted transcription factor activities

Author: Dickerson Julie A
Fu Yao
Jarboe Laura R
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Gene regulatory networks play essential roles in living organisms to control growth, keep internal metabolism running and respond to external environmental changes. Understanding the connections and the activity levels of regulators is important for the research of gene regulatory networks. While relevance score based algorithms that reconstruct gene regulatory networks from transcriptome data can infer genome-wide gene regulatory networks, they are unfortunately prone to false positive results. Transcription factor activities (TFAs) quantitatively reflect the ability of the transcription factor to regulate target genes. However, classic relevance score based gene regulatory network reconstruction algorithms use models do not include the TFA layer, thus missing a key regulatory element. Results This work integrates TFA prediction algorithms with relevance score based network reconstruction algorithms to reconstruct gene regulatory networks with improved accuracy over classic relevance score based algorithms. This method is called Gene expression and Transcription factor activity based Relevance Network (GTRNetwork). Different combinations of TFA prediction algorithms and relevance score functions have been applied to find the most efficient combination. When the integrated GTRNetwork method was applied to <it>E. coli </it>data, the reconstructed genome-wide gene regulatory network predicted 381 new regulatory links. This reconstructed gene regulatory network including the predicted new regulatory links show promising biological significances. Many of the new links are verified by known TF binding site information, and many other links can be verified from the literature and databases such as EcoCyc. The reconstructed gene regulatory network is applied to a recent transcriptome analysis of <it>E. coli </it>during isobutanol stress. In addition to the 16 significantly changed TFAs detected in the original paper, another 7 significantly changed TFAs have been detected by using our reconstructed network. Conclusions The GTRNetwork algorithm introduces the hidden layer TFA into classic relevance score-based gene regulatory network reconstruction processes. Integrating the TFA biological information with regulatory network reconstruction algorithms significantly improves both detection of new links and reduces that rate of false positives. The application of GTRNetwork on <it>E. coli </it>gene transcriptome data gives a set of potential regulatory links with promising biological significance for isobutanol stress and other conditions.</p

Digital Repository @ Iowa State University (ISU)

Crossref

Springer - Publisher Connector

Dissecting the Specificity of Protein-Protein Interaction in Bacterial Two-Component Signaling: Orphans and Crosstalks

Author: Hwa Terence
Lunt Bryan
Procaccini Andrea
Rattray Magnus
Szurmant Hendrik
Weigt Martin
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 09/05/2011
Field of study

Predictive understanding of the myriads of signal transduction pathways in a cell is an outstanding challenge of systems biology. Such pathways are primarily mediated by specific but transient protein-protein interactions, which are difficult to study experimentally. In this study, we dissect the specificity of protein-protein interactions governing two-component signaling (TCS) systems ubiquitously used in bacteria. Exploiting the large number of sequenced bacterial genomes and an operon structure which packages many pairs of interacting TCS proteins together, we developed a computational approach to extract a molecular interaction code capturing the preferences of a small but critical number of directly interacting residue pairs. This code is found to reflect physical interaction mechanisms, with the strongest signal coming from charged amino acids. It is used to predict the specificity of TCS interaction: Our results compare favorably to most available experimental results, including the prediction of 7 (out of 8 known) interaction partners of orphan signaling proteins in Caulobacter crescentus. Surveying among the available bacterial genomes, our results suggest 15~25% of the TCS proteins could participate in out-of-operon "crosstalks". Additionally, we predict clusters of crosstalking candidates, expanding from the anecdotally known examples in model organisms. The tools and results presented here can be used to guide experimental studies towards a system-level understanding of two-component signaling.Comment: Supplementary information available on http://www.plosone.org/article/info:doi/10.1371/journal.pone.001972

Public Library of Science (PLOS)

Springer - Publisher Connector

Operon Prediction with Bayesian Classifiers

Author: Khuri Natalia
Publication venue: SJSU ScholarWorks
Publication date: 01/01/2007
Field of study

In this work, we present an approach to predicting transcription units based on Bayesian classifiers. The predictor uses publicly available data to train the classifier, such as genome sequence data from Genbank, expression values from microarray experiments, and a collection of experimentally verified transcription units. We have studied the importance of each of the data source on the performance of the predictor by developing three classifier models and evaluating their outcomes. The predictor was trained and validated on the E. coli genome, but can be extended to other organisms. Using the full Bayesian classifier, we were able to correctly identify 80% of gene pairs belonging to operons

SJSU ScholarWorks

Predicting protein linkages in bacteria: Which method is best depends on task

Author: A Karimpour-Fard
A Karimpour-Fard
A Karimpour-Fard
AJ Enright
AK Ramani
Anis Karimpour-Fard
B Rost
BP Westover
C von Mering
CM Fraser
D Barker
D Eisenberg
DJ Watts
E Nabieva
EM Marcotte
G Kolesov
G Moreno-Hagelsieb
G Moreno-Hagelsieb
H Salgado
H Salgado
I Shah
I Yanai
J Bockhorst
J Bockhorst
J Sun
J Sun
JC Mellor
L Wang
Lawrence E Hunter
M Craven
M Huynen
M Pellegrini
M Strong
MA Huynen
MD Ermolaeva
OG Troyanskaya
OX Cordero
P Shannon
PD Karp
PM Bowers
PR Romero
R Jansen
R Jothi
R Overbeek
R Overbeek
RL Tatusov
Ryan T Gill
S Leach
S Tsoka
SC Janga
Sonia M Leach
SV Date
T Dandekar
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Applications of computational methods for predicting protein functional linkages are increasing. In recent years, several bacteria-specific methods for predicting linkages have been developed. The four major genomic context methods are: Gene cluster, Gene neighbor, Rosetta Stone, and Phylogenetic profiles. These methods have been shown to be powerful tools and this paper provides guidelines for when each method is appropriate by exploring different features of each method and potential improvements offered by their combination. We also review many previous treatments of these prediction methods, use the latest available annotations, and offer a number of new observations. Results Using <it>Escherichia coli </it>K12 and <it>Bacillus subtilis</it>, linkage predictions made by each of these methods were evaluated against three benchmarks: functional categories defined by COG and KEGG, known pathways listed in EcoCyc, and known operons listed in RegulonDB. Each evaluated method had strengths and weaknesses, with no one method dominating all aspects of predictive ability studied. For functional categories, as previous studies have shown, the Rosetta Stone method was individually best at detecting linkages and predicting functions among proteins with shared KEGG categories while the Phylogenetic profile method was best for linkage detection and function prediction among proteins with common COG functions. Differences in performance under COG versus KEGG may be attributable to the presence of paralogs. Better function prediction was observed when using a weighted combination of linkages based on reliability versus using a simple unweighted union of the linkage sets. For pathway reconstruction, 99 complete metabolic pathways in <it>E. coli </it>K12 (out of the 209 known, non-trivial pathways) and 193 pathways with 50% of their proteins were covered by linkages from at least one method. Gene neighbor was most effective individually on pathway reconstruction, with 48 complete pathways reconstructed. For operon prediction, Gene cluster predicted completely 59% of the known operons in <it>E. coli </it>K12 and 88% (333/418)in <it>B. subtilis</it>. Comparing two versions of the <it>E. coli </it>K12 operon database, many of the unannotated predictions in the earlier version were updated to true predictions in the later version. Using only linkages found by both Gene Cluster and Gene Neighbor improved the precision of operon predictions. Additionally, as previous studies have shown, combining features based on intergenic region and protein function improved the specificity of operon prediction. Conclusion A common problem for computational methods is the generation of a large number of false positives that might be caused by an incomplete source of validation. By comparing two versions of a database, we demonstrated the dramatic differences on reported results. We used several benchmarks on which we have shown the comparative effectiveness of each prediction method, as well as provided guidelines as to which method is most appropriate for a given prediction task.</p

Crossref

Overlapping stochastic block models with application to the French political blogosphere

Author: Ambroise Christophe
Birmelé Etienne
Latouche Pierre
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2011
Field of study

Complex systems in nature and in society are often represented as networks, describing the rich set of interactions between objects of interest. Many deterministic and probabilistic clustering methods have been developed to analyze such structures. Given a network, almost all of them partition the vertices into disjoint clusters, according to their connection profile. However, recent studies have shown that these techniques were too restrictive and that most of the existing networks contained overlapping clusters. To tackle this issue, we present in this paper the Overlapping Stochastic Block Model. Our approach allows the vertices to belong to multiple clusters, and, to some extent, generalizes the well-known Stochastic Block Model [Nowicki and Snijders (2001)]. We show that the model is generically identifiable within classes of equivalence and we propose an approximate inference procedure, based on global and local variational techniques. Using toy data sets as well as the French Political Blogosphere network and the transcriptional network of Saccharomyces cerevisiae, we compare our work with other approaches.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS382 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org