6,972 research outputs found

    Automatic reconstruction of a bacterial regulatory network using Natural Language Processing

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Manual curation of biological databases, an expensive and labor-intensive process, is essential for high quality integrated data. In this paper we report the implementation of a state-of-the-art Natural Language Processing system that creates computer-readable networks of regulatory interactions directly from different collections of abstracts and full-text papers. Our major aim is to understand how automatic annotation using Text-Mining techniques can complement manual curation of biological databases. We implemented a rule-based system to generate networks from different sets of documents dealing with regulation in <it>Escherichia coli </it>K-12.</p> <p>Results</p> <p>Performance evaluation is based on the most comprehensive transcriptional regulation database for any organism, the manually-curated RegulonDB, 45% of which we were able to recreate automatically. From our automated analysis we were also able to find some new interactions from papers not already curated, or that were missed in the manual filtering and review of the literature. We also put forward a novel Regulatory Interaction Markup Language better suited than SBML for simultaneously representing data of interest for biologists and text miners.</p> <p>Conclusion</p> <p>Manual curation of the output of automatic processing of text is a good way to complement a more detailed review of the literature, either for validating the results of what has been already annotated, or for discovering facts and information that might have been overlooked at the triage or curation stages.</p

    A text-mining system for extracting metabolic reactions from full-text articles

    Get PDF
    Background: Increasingly biological text mining research is focusing on the extraction of complex relationships relevant to the construction and curation of biological networks and pathways. However, one important category of pathway—metabolic pathways—has been largely neglected. Here we present a relatively simple method for extracting metabolic reaction information from free text that scores different permutations of assigned entities (enzymes and metabolites) within a given sentence based on the presence and location of stemmed keywords. This method extends an approach that has proved effective in the context of the extraction of protein–protein interactions. Results: When evaluated on a set of manually-curated metabolic pathways using standard performance criteria, our method performs surprisingly well. Precision and recall rates are comparable to those previously achieved for the well-known protein-protein interaction extraction task. Conclusions: We conclude that automated metabolic pathway construction is more tractable than has often been assumed, and that (as in the case of protein–protein interaction extraction) relatively simple text-mining approaches can prove surprisingly effective. It is hoped that these results will provide an impetus to further research and act as a useful benchmark for judging the performance of more sophisticated methods that are yet to be developed

    Modeling and evolving biochemical networks: insights into communication and computation from the biological domain

    Get PDF
    This paper is concerned with the modeling and evolving of Cell Signaling Networks (CSNs) in silico. CSNs are complex biochemical networks responsible for the coordination of cellular activities. We examine the possibility to computationally evolve and simulate Artificial Cell Signaling Networks (ACSNs) by means of Evolutionary Computation techniques. From a practical point of view, realizing and evolving ACSNs may provide novel computational paradigms for a variety of application areas. For example, understanding some inherent properties of CSNs such as crosstalk may be of interest: A potential benefit of engineering crosstalking systems is that it allows the modification of a specific process according to the state of other processes in the system. This is clearly necessary in order to achieve complex control tasks. This work may also contribute to the biological understanding of the origins and evolution of real CSNs. An introduction to CSNs is first provided, in which we describe the potential applications of modeling and evolving these biochemical networks in silico. We then review the different classes of techniques to model CSNs, this is followed by a presentation of two alternative approaches employed to evolve CSNs within the ESIGNET project. Results obtained with these methods are summarized and discussed

    Coupling metabolic footprinting and flux balance analysis to predict how single gene knockouts perturb microbial metabolism

    Get PDF
    Tese de mestrado. Biologia (Bioinformática e Biologia Computacional). Universidade de Lisboa, Faculdade de Ciências, 2012The model organisms Caenorhabditis elegans and E. coli form one of the simplest gut microbe host interaction models. Interventions in the microbe that increase the host longevity including inhibition of folate synthesis have been reported previously. To find novel single gene knockouts with an effect on lifespan, a screen of the Keio collection of E. coli was undertaken, and some of the genes found are directly involved in metabolism. The next step in those specific cases is to understand how these mutations perturb metabolism systematically, so that hypotheses can be generated. For that, I employed dynamic Flux Balance Analysis (dFBA), a constraint-based modeling technique capable of simulating the dynamics of metabolism in a batch culture and making predictions about changes in intracellular flux distribution. Since the specificities of the C. elegans lifespan experiments demand us to culture microbes in conditions differing from most of the published literature on E. coli physiology, novel data must be acquired to characterize and make dFBA simulations as realistic as possible. To do this exchange fluxes were measured using quantitative H NMR Time-Resolved Metabolic Footprinting. Furthermore, I also investigate the combination of TReF and dFBA as a tool in microbial metabolism studies. These approaches were tested by comparing wild type E. coli with one of the knockout strains found, ΔmetL, a knockout of the metL gene which encodes a byfunctional enzyme involved in aspartate and threonine metabolism. I found that the strain exhibits a slower growth rate than the wild type. Model simulation results revealed that reduced homoserine and methionine synthesis, as well as impaired sulfur and folate metabolism are the main effects of this knockout and the reasons for the growth deficiency. These results indicate that there are common mechanisms of the lifespan extension between ΔmetL and inhibition of folate biosynthesis and that the flux balance analysis/metabolic footprinting approach can help us understand the nature of these mechanisms.Os organismos modelo Caenorhabditis elegans e E. coli formam um dos modelos mais simples de interacções entre micróbio do tracto digestivo e hospedeiro. Intervenções no micróbio capazes de aumentar a longevidade do hospedeiro, incluindo inibição de síntese de folatos, foram reportadas previamente. Para encontrar novas delecções génicas do micróbio capazes de aumentar a longevidade do hospedeiro, a colecção Keio de deleções génicas de E. coli foi rastreada. Alguns dos genes encontrados participam em processos metabólicos, e nesses casos, o próximpo passo é perceber como as deleções perturbam o metabolismo sistémicamente, para gerar hipóteses. Para isso, utilizo dynamic Flux Balance Analysis (dFBA), uma técnica de modelação metabólica capaz de fazer previsões sobre alterações na distribuição intracelular de fluxos. As especificidades das experiências de tempo de vida em C.elegans obrigam-nos a trabalhar em condições diferentes das usadas na maioria da literatura publicada em fisiologia de E. coli, e para dar o máximo realismo às simulações de dFBA novos dados foram adquiridos, utilizando H NMR Time-Resolved Metabolic Footprinting para medir fluxos de troca de metabolitos entre microorganismo e meio de cultura. A combinação de TReF e dFBA como ferramenta de estudo do metabolism microbiano é também investigada. Estas abordagens foram testadas ao comparar E. coli wild-type com uma das estirpes encontradas no rastreio, ΔmetL, knockout do gene metL, que codifica um enzima bifunctional participante no metabolismo de aspartato e treonina, e que exibe uma taxa de crescimento reduzida comparativamente ao wild-type. Os resultados das simulações revelaram que os principais efeitos da deleção deste gene, e as razões para a menor taxa de crescimento observada, são a produção reduzida de homoserina e metionina e os efeitos que provoca no metabolismo de folatos e enxofre. Estes resultados indicam que há mecanismos comuns na extensão da longevidade causada por esta deleção e inibição de síntese de folatos, e que a combinação metabolic footprinting/flux balance analysis pode ajudar-nos a compreender a natureza desses mecanismos

    Semantic systems biology of prokaryotes : heterogeneous data integration to understand bacterial metabolism

    Get PDF
    The goal of this thesis is to improve the prediction of genotype to phenotypeassociations with a focus on metabolic phenotypes of prokaryotes. This goal isachieved through data integration, which in turn required the development ofsupporting solutions based on semantic web technologies. Chapter 1 providesan introduction to the challenges associated to data integration. Semantic webtechnologies provide solutions to some of these challenges and the basics ofthese technologies are explained in the Introduction. Furthermore, the ba-sics of constraint based metabolic modeling and construction of genome scalemodels (GEM) are also provided. The chapters in the thesis are separated inthree related topics: chapters 2, 3 and 4 focus on data integration based onheterogeneous networks and their application to the human pathogen M. tu-berculosis; chapters 5, 6, 7, 8 and 9 focus on the semantic web based solutionsto genome annotation and applications thereof; and chapter 10 focus on thefinal goal to associate genotypes to phenotypes using GEMs. Chapter 2 provides the prototype of a workflow to efficiently analyze in-formation generated by different inference and prediction methods. This me-thod relies on providing the user the means to simultaneously visualize andanalyze the coexisting networks generated by different algorithms, heteroge-neous data sets, and a suite of analysis tools. As a show case, we have ana-lyzed the gene co-expression networks of M. tuberculosis generated using over600 expression experiments. Hereby we gained new knowledge about theregulation of the DNA repair, dormancy, iron uptake and zinc uptake sys-tems. Furthermore, it enabled us to develop a pipeline to integrate ChIP-seqdat and a tool to uncover multiple regulatory layers. In chapter 3 the prototype presented in chapter 2 is further developedinto the Synchronous Network Data Integration (SyNDI) framework, whichis based on Cytoscape and Galaxy. The functionality and usability of theframework is highlighted with three biological examples. We analyzed thedistinct connectivity of plasma metabolites in networks associated with highor low latent cardiovascular disease risk. We obtained deeper insights froma few similar inflammatory response pathways in Staphylococcus aureus infec-tion common to human and mouse. We identified not yet reported regulatorymotifs associated with transcriptional adaptations of M. tuberculosis.In chapter 4 we present a review providing a systems level overview ofthe molecular and cellular components involved in divalent metal homeosta-sis and their role in regulating the three main virulence strategies of M. tu-berculosis: immune modulation, dormancy and phagosome escape. With theuse of the tools presented in chapter 2 and 3 we identified a single regulatorycascade for these three virulence strategies that respond to limited availabilityof divalent metals in the phagosome. The tools presented in chapter 2 and 3 achieve data integration throughthe use of multiple similarity, coexistence, coexpression and interaction geneand protein networks. However, the presented tools cannot store additional(genome) annotations. Therefore, we applied semantic web technologies tostore and integrate heterogeneous annotation data sets. An increasing num-ber of widely used biological resources are already available in the RDF datamodel. There are however, no tools available that provide structural overviewsof these resources. Such structural overviews are essential to efficiently querythese resources and to assess their structural integrity and design. There-fore, in chapter 5, I present RDF2Graph, a tool that automatically recoversthe structure of an RDF resource. The generated overview enables users tocreate complex queries on these resources and to structurally validate newlycreated resources. Direct functional comparison support genotype to phenotype predictions.A prerequisite for a direct functional comparison is consistent annotation ofthe genetic elements with evidence statements. However, the standard struc-tured formats used by the public sequence databases to present genome an-notations provide limited support for data mining, hampering comparativeanalyses at large scale. To enable interoperability of genome annotations fordata mining application, we have developed the Genome Biology OntologyLanguage (GBOL) and associated infrastructure (GBOL stack), which is pre-sented in chapter 6. GBOL is provenance aware and thus provides a consistentrepresentation of functional genome annotations linked to the provenance.The provenance of a genome annotation describes the contextual details andderivation history of the process that resulted in the annotation. GBOL is mod-ular in design, extensible and linked to existing ontologies. The GBOL stackof supporting tools enforces consistency within and between the GBOL defi-nitions in the ontology. Based on GBOL, we developed the genome annotation pipeline SAPP (Se-mantic Annotation Platform with Provenance) presented in chapter 7. SAPPautomatically predicts, tracks and stores structural and functional annotationsand associated dataset- and element-wise provenance in a Linked Data for-mat, thereby enabling information mining and retrieval with Semantic Webtechnologies. This greatly reduces the administrative burden of handling mul-tiple analysis tools and versions thereof and facilitates multi-level large scalecomparative analysis. In turn this can be used to make genotype to phenotypepredictions. The development of GBOL and SAPP was done simultaneously. Duringthe development we realized that we had to constantly validated the data ex-ported to RDF to ensure coherence with the ontology. This was an extremelytime consuming process and prone to error, therefore we developed the Em-pusa code generator. Empusa is presented in chapter 8. SAPP has been successfully used to annotate 432 sequenced Pseudomonas strains and integrate the resulting annotation in a large scale functional com-parison using protein domains. This comparison is presented in chapter 9.Additionally, data from six metabolic models, nearly a thousand transcrip-tome measurements and four large scale transposon mutagenesis experimentswere integrated with the genome annotations. In this way, we linked gene es-sentiality, persistence and expression variability. This gave us insight into thediversity, versatility and evolutionary history of the Pseudomonas genus, whichcontains some important pathogens as well some useful species for bioengi-neering and bioremediation purposes. Genome annotation can be used to create GEM, which can be used to betterlink genotypes to phenotypes. Bio-Growmatch, presented in chapter 10, istool that can automatically suggest modification to improve a GEM based onphenotype data. Thereby integrating growth data into the complete processof modelling the metabolism of an organism. Chapter 11 presents a general discussion on how the chapters contributedthe central goal. After which I discuss provenance requirements for data reuseand integration. I further discuss how this can be used to further improveknowledge generation. The acquired knowledge could, in turn, be used to de-sign new experiments. The principles of the dry-lab cycle and how semantictechnologies can contribute to establish these cycles are discussed in chapter11. Finally a discussion is presented on how to apply these principles to im-prove the creation and usability of GEM’s.</p

    "Going back to our roots": second generation biocomputing

    Full text link
    Researchers in the field of biocomputing have, for many years, successfully "harvested and exploited" the natural world for inspiration in developing systems that are robust, adaptable and capable of generating novel and even "creative" solutions to human-defined problems. However, in this position paper we argue that the time has now come for a reassessment of how we exploit biology to generate new computational systems. Previous solutions (the "first generation" of biocomputing techniques), whilst reasonably effective, are crude analogues of actual biological systems. We believe that a new, inherently inter-disciplinary approach is needed for the development of the emerging "second generation" of bio-inspired methods. This new modus operandi will require much closer interaction between the engineering and life sciences communities, as well as a bidirectional flow of concepts, applications and expertise. We support our argument by examining, in this new light, three existing areas of biocomputing (genetic programming, artificial immune systems and evolvable hardware), as well as an emerging area (natural genetic engineering) which may provide useful pointers as to the way forward.Comment: Submitted to the International Journal of Unconventional Computin

    Text-mining of PubMed abstracts by natural language processing to create a public knowledge base on molecular mechanisms of bacterial enteropathogens

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The Enteropathogen Resource Integration Center (ERIC; <url>http://www.ericbrc.org</url>) has a goal of providing bioinformatics support for the scientific community researching enteropathogenic bacteria such as <it>Escherichia coli </it>and <it>Salmonella </it>spp. Rapid and accurate identification of experimental conclusions from the scientific literature is critical to support research in this field. Natural Language Processing (NLP), and in particular Information Extraction (IE) technology, can be a significant aid to this process.</p> <p>Description</p> <p>We have trained a powerful, state-of-the-art IE technology on a corpus of abstracts from the microbial literature in PubMed to automatically identify and categorize biologically relevant entities and predicative relations. These relations include: Genes/Gene Products and their Roles; Gene Mutations and the resulting Phenotypes; and Organisms and their associated Pathogenicity. Evaluations on blind datasets show an F-measure average of greater than 90% for entities (genes, operons, etc.) and over 70% for relations (gene/gene product to role, etc). This IE capability, combined with text indexing and relational database technologies, constitute the core of our recently deployed text mining application.</p> <p>Conclusion</p> <p>Our Text Mining application is available online on the ERIC website <url>http://www.ericbrc.org/portal/eric/articles</url>. The information retrieval interface displays a list of recently published enteropathogen literature abstracts, and also provides a search interface to execute custom queries by keyword, date range, etc. Upon selection, processed abstracts and the entities and relations extracted from them are retrieved from a relational database and marked up to highlight the entities and relations. The abstract also provides links from extracted genes and gene products to the ERIC Annotations database, thus providing access to comprehensive genomic annotations and adding value to both the text-mining and annotations systems.</p

    Computational Studies and Biosynthesis of Natural Products with Promising Anticancer Properties

    Get PDF
    We present an overview of computational approaches for the prediction of metabolic pathways by which plants biosynthesise compounds, with a focus on selected very promising anticancer secondary metabolites from floral sources. We also provide an overview of databases for the retrieval of useful genomic data, discussing the strengths and limitations of selected prediction software and the main computational tools (and methods), which could be employed for the investigation of the uncharted routes towards the biosynthesis of some of the identified anticancer metabolites from plant sources, eventually using specific examples to address some knowledge gaps when using these approaches
    corecore