405 research outputs found
Phylogenetic correlations can suffice to infer protein partners from sequences
International audienceDetermining which proteins interact together is crucial to a systems-level understanding of the cell. Recently, algorithms based on Direct Coupling Analysis (DCA) pairwise maximum-entropy models have allowed to identify interaction partners among paralogous proteins from sequence data. This success of DCA at predicting protein-protein interactions could be mainly based on its known ability to identify pairs of residues that are in contact in the three-dimensional structure of protein complexes and that coevolve to remain physicochemically complementary. However, interacting proteins possess similar evolutionary histories. What is the role of purely phylogenetic correlations in the performance of DCA-based methods to infer interaction partners? To address this question, we employ controlled synthetic data that only involve phylogeny and no interactions or contacts. We find that DCA accurately identifies the pairs of synthetic sequences that share evolutionary history. While phylogenetic correlations confound the identification of contacting residues by DCA, they are thus useful to predict interacting partners among paralogs. We find that DCA performs as well as phylogenetic methods to this end, and slightly better than them with large and accurate training sets. Employing DCA or phylogenetic methods within an Iterative Pairing Algorithm (IPA) allows to predict pairs of evolutionary partners without a training set. We further demonstrate the ability of these various methods to correctly predict pairings among real paralogous proteins with genome proximity but no known direct physical interaction, illustrating the importance of phylogenetic correlations in natural data. However, for physically interacting and strongly coevolving proteins, DCA and mutual information outperform phylogenetic methods. We finally discuss how to distinguish physically interacting proteins from proteins that only share a common evolutionary history
Inferring interaction partners from protein sequences using mutual information
Functional protein-protein interactions are crucial in most cellular
processes. They enable multi-protein complexes to assemble and to remain
stable, and they allow signal transduction in various pathways. Functional
interactions between proteins result in coevolution between the interacting
partners, and thus in correlations between their sequences. Pairwise
maximum-entropy based models have enabled successful inference of pairs of
amino-acid residues that are in contact in the three-dimensional structure of
multi-protein complexes, starting from the correlations in the sequence data of
known interaction partners. Recently, algorithms inspired by these methods have
been developed to identify which proteins are functional interaction partners
among the paralogous proteins of two families, starting from sequence data
alone. Here, we demonstrate that a slightly higher performance for partner
identification can be reached by an approximate maximization of the mutual
information between the sequence alignments of the two protein families. Our
mutual information-based method also provides signatures of the existence of
interactions between protein families. These results stand in contrast with
structure prediction of proteins and of multi-protein complexes from sequence
data, where pairwise maximum-entropy based global statistical models
substantially improve performance compared to mutual information. Our findings
entail that the statistical dependences allowing interaction partner prediction
from sequence data are not restricted to the residue pairs that are in direct
contact at the interface between the partner proteins.Comment: 26 pages, 11 figures, published versio
Transkingdom Networks: A Systems Biology Approach to Identify Causal Members of Host-Microbiota Interactions
Improvements in sequencing technologies and reduced experimental costs have
resulted in a vast number of studies generating high-throughput data. Although
the number of methods to analyze these "omics" data has also increased,
computational complexity and lack of documentation hinder researchers from
analyzing their high-throughput data to its true potential. In this chapter we
detail our data-driven, transkingdom network (TransNet) analysis protocol to
integrate and interrogate multi-omics data. This systems biology approach has
allowed us to successfully identify important causal relationships between
different taxonomic kingdoms (e.g. mammals and microbes) using diverse types of
data
Information Theory in Molecular Evolution: From Models to Structures and Dynamics
This Special Issue collects novel contributions from scientists in the interdisciplinary field of biomolecular evolution. Works listed here use information theoretical concepts as a core but are tightly integrated with the study of molecular processes. Applications include the analysis of phylogenetic signals to elucidate biomolecular structure and function, the study and quantification of structural dynamics and allostery, as well as models of molecular interaction specificity inspired by evolutionary cues
Protein 3D Structure Computed from Evolutionary Sequence Variation
The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing
Unsupervised inference methods for protein sequence data
L'abstract è presente nell'allegato / the abstract is in the attachmen
Assessing the utility of mutual information stored in protein-protein interfaces to infer specific protein partners
Tese (doutorado)—Universidade de Brasília, Instituto de Ciências Biológicas, Departamento de Biologia Celular, Programa de Pós-Graduação em Biologia Molecular, 2021.Proteínas são essenciais para diversos processos celulares. Assim, um dos objetivos centrais da Biologia é
entender as relações entre sequência, estrutura e função dessas macromoléculas. Nesse contexto, as
marcas deixadas pelo processo coevolutivo em sequências de proteínas parceiras são uma importante fonte
de informação estrutural. De fato, as correlações estatísticas entre sítios de aminoácidos em sequências de
proteínas são a base dos métodos mais modernos para a previsão de contatos inter- e intra-proteínas,
predição de estrutura tridimensional, identificação de sítios funcionais e resíduos determinantes de
especificidade, inferência de interações entre parálogos, entre outras aplicações. Em consonância com isso,
o presente trabalho apresenta um conjunto de resultados teóricos sobre como proteínas parceiras
específicas podem ser recuperadas com base apenas nas informações da sequência. No primeiro capítulo,
é realizada uma decomposição da informação mútua (MI) presente nos complexos proteína-proteína,
considerando a hipótese de que a MI em proteínas se origina de uma combinação de diferentes fontes:
coevolutiva, evolutiva e estocástica. Foi observado que a interface contém, em média por contato, mais
informações do que o restante do complexo protéico, resultado que se mantém quando se considera tanto a
MI de Shannon quanto a de Tsallis como medida de informação. Essa observação levou à conclusão de que
a interface contém o sinal de informação mais forte para distinguir o conjunto correto de proteínas parceiras
em famílias de proteínas que interagem. Com base nisso, a utilidade de usar a MI armazenada em
interfaces proteína-proteína para recuperar o conjunto correto de proteínas parceiras é avaliada no segundo
capítulo. Um algoritmo genético (GA) foi desenvolvido para explorar o espaço de possíveis concatenações
entre um par de famílias de proteínas que interagem usando a MI da interface como função objetivo.
Usando o GA, a maximização da MI da interface foi realizada para 26 pares de famílias de proteínas que
interagem e foi observado que concatenações otimizadas correspondem a soluções degeneradas com duas
fontes de erro distintas, decorrentes de pareamentos errados entre (i) sequências similares e (ii) não
similares. Quando os erros cometidos com sequências semelhantes foram desconsiderados, as soluções do
tipo (i) apresentaram taxas de verdadeiros positivos (TP) de 70 % - muito acima das mesmas estimativas
para soluções do tipo (ii). Esses resultados se mantêm quando as otimizações são feitas com base na MI de
Tsallis. Essas descobertas levantam questões sobre os mecanismos por trás da coevolução de proteínas
parceiras e ajudam a racionalizar os dados da literatura que mostram uma forte deterioração das taxas de
TP com o aumento do número de sequência em abordagens baseadas em MI.Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES).Proteins are essential for several cellular processes. Hence, one of the central objectives in Biology is to
understand the relationships between sequence, structure and function of these macromolecules. In this
context, marks left by the coevolutionary process in interacting protein sequences are an important source of
structural information. In fact, statistical correlations between amino acid sites in protein sequences are at
the basis of state-of-the-art methods for prediction of inter- and intra-protein contacts, template-free structure
prediction, identification of functional sites and specificity determining residues, inference of interacting
paralogs, among other applications. In line with that, the present work conveys a set of theoretical results on
how specific protein partners can be recovered based on sequence information alone. In the first chapter, a
decomposition of the mutual information (MI) present in protein-protein complexes is carried out, considering
the hypothesis that MI in proteins is originated from a combination of coevolutive, evolutive and stochastic
sources. It was observed that the interface contains on average, by contact, more information than the rest of
the protein complex, a result that holds when considering both Shannon and Tsallis MI as a measure of
information. This observation led to the conclusion that the interface contains the strongest information signal
for distinguishing the correct set of protein partners in interacting protein families. Building on that, the utility
of using MI encoded on protein-protein interfaces to recover the correct set of protein partners is assessed in
the second chapter. A genetic algorithm (GA) was developed to explore the space of possible concatenations
between a pair of interacting protein families using the interface MI as objective function. Using the GA,
interface MI maximization was performed for 26 different pairs of interacting protein families and it was
observed that optimized concatenations corresponded to degenerate solutions with two distinct error
sources, arising from mismatches among (i) similar and (ii) non-similar sequences. When mistakes made
among similar sequences were disregarded, type-(i) solutions were found to resolve correct pairings at best
true positive (TP) rates of 70% - far above the very same estimates in type-(ii) solutions. These results hold
when the optimizations are made based on Tsallis MI. These findings raise further questions about the
mechanisms behind protein partners coevolution and help rationalize literature data showing a sharp
deterioration of TP rates with increasing sequence number in MI-based approaches
Assessing Microbial Diversity Through Nucleotide Variation
Microbes are the most abundant and most diverse form of life on Earth, constituting the largest portion of the total biomass of the entire planet. They are present in every niche in nature, including very extreme environments, and they govern biogeochemical transformations in ecosystems. The human body is home to a diverse assemblage of microbial species as well. In fact, the number of microbial cells in the gastrointestinal tract, oral cavity, skin, airway passages and urogenital system is approximately an order of magnitude greater than the number of cells that make up the human body itself, and changes in the composition and relative abundance of these microbial communities are highly associated with intestinal and respiratory disorders and diseases of the skin and mucus membranes. In the early 1990\u27s, cultivation-‐independent methods, especially those based on PCR-‐amplification and sequences of phylogenetically informative 16S rRNA genes, made it possible to assess the composition of microbial species in natural environments, advances in high-‐throughput sequencing technologies in recent years have increased sequencing capacity and microbial detection by orders of magnitude. However, the effectiveness of current computational methods available to analyze the vast amounts of sequence data is poor and investigating the diversity within microbial communities remains challenging. In addition to offering an easy-‐to-‐use visualization and statistical analysis framework for microbial community analyses, the study described herein aims to present a biologically relevant computational approach for assessing microbial diversity at finer scales of microbial communities through nucleotide variation in 16S rRNA genes
- …