75 research outputs found

    Evolutionary constraints on the complexity of genetic regulatory networks allow predictions of the total number of genetic interactions

    Full text link
    Genetic regulatory networks (GRNs) have been widely studied, yet there is a lack of understanding with regards to the final size and properties of these networks, mainly due to no network currently being complete. In this study, we analyzed the distribution of GRN structural properties across a large set of distinct prokaryotic organisms and found a set of constrained characteristics such as network density and number of regulators. Our results allowed us to estimate the number of interactions that complete networks would have, a valuable insight that could aid in the daunting task of network curation, prediction, and validation. Using state-of-the-art statistical approaches, we also provided new evidence to settle a previously stated controversy that raised the possibility of complete biological networks being random and therefore attributing the observed scale-free properties to an artifact emerging from the sampling process during network discovery. Furthermore, we identified a set of properties that enabled us to assess the consistency of the connectivity distribution for various GRNs against different alternative statistical distributions. Our results favor the hypothesis that highly connected nodes (hubs) are not a consequence of network incompleteness. Finally, an interaction coverage computed for the GRNs as a proxy for completeness revealed that high-throughput based reconstructions of GRNs could yield biased networks with a low average clustering coefficient, showing that classical targeted discovery of interactions is still needed.Comment: 28 pages, 5 figures, 12 pages supplementary informatio

    A unified resource for transcriptional regulation in Escherichia coli K-12 incorporating high-throughput-generated binding data into RegulonDB version 10.0

    Get PDF
    Abstract Background Our understanding of the regulation of gene expression has benefited from the availability of high-throughput technologies that interrogate the whole genome for the binding of specific transcription factors and gene expression profiles. In the case of widely used model organisms, such as Escherichia coli K-12, the new knowledge gained from these approaches needs to be integrated with the legacy of accumulated knowledge from genetic and molecular biology experiments conducted in the pre-genomic era in order to attain the deepest level of understanding possible based on the available data. Results In this paper, we describe an expansion of RegulonDB, the database containing the rich legacy of decades of classic molecular biology experiments supporting what we know about gene regulation and operon organization in E. coli K-12, to include the genome-wide dataset collections from 32 ChIP and 19 gSELEX publications, in addition to around 60 genome-wide expression profiles relevant to the functional significance of these datasets and used in their curation. Three essential features for the integration of this information coming from different methodological approaches are: first, a controlled vocabulary within an ontology for precisely defining growth conditions; second, the criteria to separate elements with enough evidence to consider them involved in gene regulation from isolated transcription factor binding sites without such support; and third, an expanded computational model supporting this knowledge. Altogether, this constitutes the basis for adequately gathering and enabling the comparisons and integration needed to manage and access such wealth of knowledge. Conclusions This version 10.0 of RegulonDB is a first step toward what should become the unifying access point for current and future knowledge on gene regulation in E. coli K-12. Furthermore, this model platform and associated methodologies and criteria can be emulated for gathering knowledge on other microbial organisms

    Network-based analysis of gene expression data

    Get PDF
    The methods of molecular biology for the quantitative measurement of gene expression have undergone a rapid development in the past two decades. High-throughput assays with the microarray and RNA-seq technology now enable whole-genome studies in which several thousands of genes can be measured at a time. However, this has also imposed serious challenges on data storage and analysis, which are subject of the young, but rapidly developing field of computational biology. To explain observations made on such a large scale requires suitable and accordingly scaled models of gene regulation. Detailed models, as available for single genes, need to be extended and assembled in larger networks of regulatory interactions between genes and gene products. Incorporation of such networks into methods for data analysis is crucial to identify molecular mechanisms that are drivers of the observed expression. As methods for this purpose emerge in parallel to each other and without knowing the standard of truth, results need to be critically checked in a competitive setup and in the context of the available rich literature corpus. This work is centered on and contributes to the following subjects, each of which represents important and distinct research topics in the field of computational biology: (i) construction of realistic gene regulatory network models; (ii) detection of subnetworks that are significantly altered in the data under investigation; and (iii) systematic biological interpretation of detected subnetworks. For the construction of regulatory networks, I review existing methods with a focus on curation and inference approaches. I first describe how literature curation can be used to construct a regulatory network for a specific process, using the well-studied diauxic shift in yeast as an example. In particular, I address the question how a detailed understanding, as available for the regulation of single genes, can be scaled-up to the level of larger systems. I subsequently inspect methods for large-scale network inference showing that they are significantly skewed towards master regulators. A recalibration strategy is introduced and applied, yielding an improved genome-wide regulatory network for yeast. To detect significantly altered subnetworks, I introduce GGEA as a method for network-based enrichment analysis. The key idea is to score regulatory interactions within functional gene sets for consistency with the observed expression. Compared to other recently published methods, GGEA yields results that consistently and coherently align expression changes with known regulation types and that are thus easier to explain. I also suggest and discuss several significant enhancements to the original method that are improving its applicability, outcome and runtime. For the systematic detection and interpretation of subnetworks, I have developed the EnrichmentBrowser software package. It implements several state-of-the-art methods besides GGEA, and allows to combine and explore results across methods. As part of the Bioconductor repository, the package provides a unified access to the different methods and, thus, greatly simplifies the usage for biologists. Extensions to this framework, that support automating of biological interpretation routines, are also presented. In conclusion, this work contributes substantially to the research field of network-based analysis of gene expression data with respect to regulatory network construction, subnetwork detection, and their biological interpretation. This also includes recent developments as well as areas of ongoing research, which are discussed in the context of current and future questions arising from the new generation of genomic data

    A Semi-Supervised Method for Predicting Transcription Factor–Gene Interactions in Escherichia coli

    Get PDF
    While Escherichia coli has one of the most comprehensive datasets of experimentally verified transcriptional regulatory interactions of any organism, it is still far from complete. This presents a problem when trying to combine gene expression and regulatory interactions to model transcriptional regulatory networks. Using the available regulatory interactions to predict new interactions may lead to better coverage and more accurate models. Here, we develop SEREND (SEmi-supervised REgulatory Network Discoverer), a semi-supervised learning method that uses a curated database of verified transcriptional factor–gene interactions, DNA sequence binding motifs, and a compendium of gene expression data in order to make thousands of new predictions about transcription factor–gene interactions, including whether the transcription factor activates or represses the gene. Using genome-wide binding datasets for several transcription factors, we demonstrate that our semi-supervised classification strategy improves the prediction of targets for a given transcription factor. To further demonstrate the utility of our inferred interactions, we generated a new microarray gene expression dataset for the aerobic to anaerobic shift response in E. coli. We used our inferred interactions with the verified interactions to reconstruct a dynamic regulatory network for this response. The network reconstructed when using our inferred interactions was better able to correctly identify known regulators and suggested additional activators and repressors as having important roles during the aerobic–anaerobic shift interface

    Blueprint: descrição da complexidade da regulação metabólica através da reconstrução de modelos metabólicos e regulatórios integrados

    Get PDF
    Tese de doutoramento em Biomedical EngineeringUm modelo metabólico consegue prever o fenótipo de um organismo. No entanto, estes modelos podem obter previsões incorretas, pois alguns processos metabólicos são controlados por mecanismos reguladores. Assim, várias metodologias foram desenvolvidas para melhorar os modelos metabólicos através da integração de redes regulatórias. Todavia, a reconstrução de modelos regulatórios e metabólicos à escala genómica para diversos organismos apresenta diversos desafios. Neste trabalho, propõe-se o desenvolvimento de diversas ferramentas para a reconstrução e análise de modelos metabólicos e regulatórios à escala genómica. Em primeiro lugar, descreve-se o Biological networks constraint-based In Silico Optimization (BioISO), uma nova ferramenta para auxiliar a curação manual de modelos metabólicos. O BioISO usa um algoritmo de relação recursiva para orientar as previsões de fenótipo. Assim, esta ferramenta pode reduzir o número de artefatos em modelos metabólicos, diminuindo a possibilidade de obter erros durante a fase de curação. Na segunda parte deste trabalho, desenvolveu-se um repositório de redes regulatórias para procariontes que permite suportar a sua integração em modelos metabólicos. O Prokaryotic Transcriptional Regulatory Network Database (ProTReND) inclui diversas ferramentas para extrair e processar informação regulatória de recursos externos. Esta ferramenta contém um sistema de integração de dados que converte dados dispersos de regulação em redes regulatórias integradas. Além disso, o ProTReND dispõe de uma aplicação que permite o acesso total aos dados regulatórios. Finalmente, desenvolveu-se uma ferramenta computacional no MEWpy para simular e analisar modelos regulatórios e metabólicos. Esta ferramenta permite ler um modelo metabólico e/ou rede regulatória, em diversos formatos. Esta estrutura consegue construir um modelo regulatório e metabólico integrado usando as interações regulatórias e as ligações entre genes e proteínas codificadas no modelo metabólico e na rede regulatória. Além disso, esta estrutura suporta vários métodos de previsão de fenótipo implementados especificamente para a análise de modelos regulatórios-metabólicos.Genome-Scale Metabolic (GEM) models can predict the phenotypic behavior of organisms. However, these models can lead to incorrect predictions, as certain metabolic processes are controlled by regulatory mechanisms. Accordingly, many methodologies have been developed to extend the reconstruction and analysis of GEM models via the integration of Transcriptional Regulatory Network (TRN)s. Nevertheless, the perspective of reconstructing integrated genome-scale regulatory and metabolic models for diverse prokaryotes is still an open challenge. In this work, we propose several tools to assist the reconstruction and analysis of regulatory and metabolic models. We start by describing BioISO, a novel tool to assist the manual curation of GEM models. BioISO uses a recursive relation-like algorithm and Flux Balance Analysis (FBA) to evaluate and guide debugging of in silico phenotype predictions. Hence, this tool can reduce the number of artifacts in GEM models, decreasing the burdens of model refinement and curation. A state-of-the-art repository of TRNs for prokaryotes was implemented to support the reconstruction and integration of TRNs into GEM models. The ProTReND repository comprehends several tools to extract and process regulatory information available in several resources. More importantly, this repository contains a data integration system to unify the regulatory data into standardized TRNs at the genome scale. In addition, ProTReND contains a web application with full access to the regulatory data. Finally, we have developed a new modeling framework to define, simulate and analyze GEnome-scale Regulatory and Metabolic (GERM) models in MEWpy. The GERM model framework can read a GEM model, as well as a TRN from different file formats. This framework assembles a GERM model using the regulatory interactions and Genes-Proteins-Reactions (GPR) rules encoded into the GEM model and TRN. In addition, this modeling framework supports several methods of phenotype prediction designed for regulatory-metabolic models.I would like to thank Fundação para a Ciência e Tecnologia for the Ph.D. studentship I was awarded with (SFRH/BD/139198/2018)

    Integration of Horizontally Transferred Genes into Regulatory Interaction Networks Takes Many Million Years

    Get PDF
    Adaptation of bacteria to new or changing environments is often associated with the uptake of foreign genes through horizontal gene transfer. However, it has remained unclear how (and how fast) new genes are integrated into their host's cellular networks. Combining the regulatory and protein interaction networks of Escherichia coli with comparative genomics tools, we provide the first systematic analysis of this issue. Genes transferred recently have fewer interaction partners compared to nontransferred genes in both regulatory and protein interaction networks. Thus, horizontally transferred genes involved in complex regulatory and protein-protein interactions are rarely favored by selection. Only few protein-protein interactions are gained after the initial integration of genes following the transfer event. In contrast, transferred genes are gradually integrated into the regulatory network of their host over evolutionary time. During adaptation to the host cellular environment, horizontally transferred genes recruit existing transcription factors of the host, reflected in the fast evolutionary rates of the cis-regulatory regions of transferred genes. Further, genes resulting from increasingly ancient transfer events show increasing numbers of transcriptional regulators as well as improved coregulation with interacting proteins. Fine-tuned integration of horizontally transferred genes into the regulatory network spans more than 8-22 million years and encompasses accelerated evolution of regulatory regions, stabilization of protein-protein interactions, and changes in codon usage

    Gene Regulatory Network Analysis and Web-based Application Development

    Get PDF
    Microarray data is a valuable source for gene regulatory network analysis. Using earthworm microarray data analysis as an example, this dissertation demonstrates that a bioinformatics-guided reverse engineering approach can be applied to analyze time-series data to uncover the underlying molecular mechanism. My network reconstruction results reinforce previous findings that certain neurotransmitter pathways are the target of two chemicals - carbaryl and RDX. This study also concludes that perturbations to these pathways by sublethal concentrations of these two chemicals were temporary, and earthworms were capable of fully recovering. Moreover, differential networks (DNs) analysis indicates that many pathways other than those related to synaptic and neuronal activities were altered during the exposure phase. A novel differential networks (DNs) approach is developed in this dissertation to connect pathway perturbation with toxicity threshold setting from Live Cell Array (LCA) data. Findings from this proof-of-concept study suggest that this DNs approach has a great potential to provide a novel and sensitive tool for threshold setting in chemical risk assessment. In addition, a web-based tool “Web-BLOM” was developed for the reconstruction of gene regulatory networks from time-series gene expression profiles including microarray and LCA data. This tool consists of several modular components: a database, the gene network reconstruction model and a user interface. The Bayesian Learning and Optimization Model (BLOM), originally implemented in MATLAB, was adopted by Web-BLOM to provide an online reconstruction of large-scale gene regulation networks. Compared to other network reconstruction models, BLOM can infer larger networks with compatible accuracy, identify hub genes and is much more computationally efficient

    GREAT: gene regulation evaluation tool

    Get PDF
    Tese de mestrado. Tecnologias de Informação aplicadas às Ciências Biológicas e Médicas. Universidade de Lisboa, Faculdade de Ciências, 2009A correcta compreensão de como funcionam os sistemas biológicos depende do estudo dos mecanismos que regulam a expressão genética. Estes mecanismos controlam em que momento e durante quanto tempo é utilizada a informação codificada num gene, e podem actuar em diversas etapas do processo de expressão genética. No presente trabalho, a etapa em análise é a transcrição, na qual a sequência de ADN de um gene é transformada numa sequência de ARN, que posteriormente dará origem a uma proteína. A regulação da transcrição centra-se na acção de uma classe de proteínas reguladoras denominadas factores de transcrição. Estes ligam-se à cadeia de ADN na região próxima do início de um gene (a região promotora), potenciando ou inibindo a ligação da proteína responsável pelo processo de transcrição. Os factores de transcrição têm especificidade para pequenas sequências de ADN (denominados motivos de ligação) que estão presentes nas regiões promotoras dos genes que regulam. Um gene pode ser regulado por diferentes factores de transcrição; um factor de transcrição pode regular diferentes genes; e dois factores de transcrição podem ter motivos de ligação iguais. A regulação dos genes que codificam factores de transcrição é ela própria regulada, podendo sê-lo por uma série de mecanismos que incluem a interacção com outros factores de transcrição. O conhecimento de como genes e proteínas interagem entre si permite a criação de modelos que representam o modo como o sistema em questão (seja um processo biológico ou uma célula) se comporta. Estes modelos podem ser representados como redes de regulação genética, que embora possam diferir estruturalmente, os seus componentes elementares podem ser descritos da seguinte forma: os vértices representam genes (ou as proteínas codificadas) e as arestas representam reacções moleculares individuais, como as interacções entre proteínas através das quais os produtos de um gene afectam os de outro. A representação de regulações genéticas em redes de regulação genética promove, entre outros aspectos, a descoberta de grupos de genes que, sendo co-regulados, participam no mesmo processo biológico. Tal como referido anteriormente, os factores de transcrição podem ser regulados por outros factores de transcrição, o que significa que existem dois tipos de regulações: directas e indirectas. Regulações directas dizem respeito a pares gene-factor de transcrição em que a expressão do gene é regulada pelo factor de transcrição considerado no par; regulações indirectas dizem respeito a pares em que a expressão do gene é regulada por um factor de transcrição cuja expressão é regulada pelo factor de transcrição considerado no par. Existem dois tipos de métodos experimentais que permitem a identificação de regulações genéticas: métodos directos, que identificam regulações directas; métodos indirectos, identificam regulações mas sem ser possível diferenciar entre directas e indirectas. Os métodos directos avaliam a ligação física do factor de transcrição ao gene, enquanto os métodos indirectos avaliam a existência de alterações nos padrões de expressão dos genes devido à influência dos factores de transcrição (isto é, se a acção de um determinado factor de transcrição se deixar de sentir, quais os genes cuja transcrição sofrerá alterações, e com que intensidade). Dos quatro métodos descritos em seguida, os dois primeiros são directos e os dois últimos indirectos:Chip (imunoprecipitação de cromatina) – esta técnica é utilizada na investigação de interacções in vivo entre DNA e proteínas [1,2]. CHIP-chip – esta técnica consiste numa adaptação da anterior, sendo realizada à escala genómica: um microarray representativo do genoma completo de um organismo é exposto a um dado FT, permitindo a identificação de todos os genes que este regula [3].Microarrays – a utilização de microarrays permite a avaliação de alterações de expressão genética em grande escala, considerando o genoma completo de um organismo ou apenas uma via metabólica [4]. Proteómica – esta abordagem inclui diversos métodos que permitem a identificação dos genes regulados por um determinado factor de transcrição através do estudo do nível de expressão das proteínas codificadas pelos genes [5]. O conhecimento existente sobre regulações genéticas encontra-se disponível essencialmente na literatura. Embora actualmente exista um número elevado de bases de dados biológicas públicas, a grande maioria contém dados sobre entidades biológicas mas não sobre regulações genéticas de forma explícita. Com o objectivo de colocar à disposição da comunidade científica dados existentes sobre regulações genéticas em Saccharomyces cerevisiae, foi criada uma base de dados portuguesa, denominada Yeastract, mantida por curação manual de literatura científica. Devido à crescente quantidade de artigos publicados actualmente, é de extrema importância o desenvolvimento de ferramentas automáticas que auxiliem o processo de curação manual. No caso concreto da Yeastract, surgiu a necessidade de criar uma ferramenta que auxiliasse o processo de identificação de artigos científicos que descrevam regulações genéticas em S. cerevisiae. Esta ferramenta é composta por dois componentes: um primeiro que identifica factores de transcrição nos resumos dos artigos e que verifica se os resumos contêm descrições de regulações genéticas; um segundo que avalia se as regulações hipotéticas que o artigo contém correspondem a regulações válidas do ponto de vista biológico. Este segundo componente foi denominado GREAT (Gene Regulation EvAluation Tool) e constitui o objectivo do meu trabalho. A ferramenta que desenvolvi recebe como input uma lista de artigos em cujos resumos foram identificados factores de transcrição e, na validação das regulações, explora dados obtidos exclusivamente de bases de dados biológicas de acesso público. Esses dados são utilizados na avaliação dos seguintes aspectos: participação de um gene e de um factor de transcrição no mesmo processo biológico; existência do local de ligação do factor de transcrição na região promotora do gene; método experimental com que a regulação foi identificada. O resultado de cada um destes aspectos é utilizado por um método de aprendizagem automática, árvores de regressão ou árvores modelo, para o cálculo de um score de confiança, a atribuir a cada potencial regulação. Artigos que contenham regulações com scores elevados serão curados manualmente para extracção das regulações genéticas. Foi implementado com sucesso um primeiro protótipo do GREAT. No entanto, do ponto de vista biológico, os resultados obtidos não foram satisfatórios, pelo que se realizou uma análise detalhada dos dados utilizados. Esta análise revelou questões importantes, essencialmente relacionadas com a insuficiência de dados disponíveis, e permitiu a identificação de medidas que poderão ser implementadas no actual protótipo para a resolução dos problemas encontrados.The understanding of biological systems is dependent on the study of the mechanisms that regulate gene expression. These mechanisms control when and for how long the information coded in a gene is used, and can act several of the steps in the gene expression process. In the present work, the step of interest is the transcription, where the DNA sequence of a gene is transformed into an RNA sequence, which will later be used to synthesise a protein. The knowledge about gene regulations is mainly available in the literature. Although there are currently multiple public biological databases, the majority of those contain data on biological entities but not explicitly on gene regulations. In order to provide the scientific community with data on Saccharomyces cerevisiae transcription regulations, a Portuguese public repository maintained by manual curation of scientific literature, named Yeastract, was created. Due to the increasing amount of papers published nowadays, the development of automatic tools that can help the curation process is of great importance. In the specific case of Yeastract, a tool was needed to help in the identification of papers describing gene regulations of S. cerevisiae. This tool was created with two components: one that identifies transcription factors in the papers’ abstracts and verifies if they describe gene regulations; the other that evaluates if the hypothetical regulations the paper contains correspond to valid regulations from a biological point of view. This second component was named GREAT, Gene Regulation EvAluation Tool, and is the goal of my work. The tool I developed uses data obtained exclusively from public biological databases to validate the regulations. That data is used in the evaluation of three aspects: the participation of a gene and a transcription factor in the same biological process; the existence of the transcription factor binding motif in the gene promoter region; the experimental method with which the regulation was identified. The output of these features is used by a machine learning method, either regression or model trees, to calculate a confidence score to attribute to each putative gene regulation. Papers containing regulations with high scores will be manually curated to extract the gene regulations. Although a first prototype of GREAT was implemented, from a biological point of view the results obtained are unsatisfactory. This prompted a detailed analysis of the data used, which uncovered important questions that need to be addressed in order to further improve this tool
    corecore