147 research outputs found

    Automated Extraction of Protein Mutation Impacts from the Biomedical Literature

    Get PDF
    Mutations as sources of evolution have long been the focus of attention in the biomedical literature. Accessing the mutational information and their impacts on protein properties facilitates research in various domains, such as enzymology and pharmacology. However, manually reading through the rich and fast growing repository of biomedical literature is expensive and time-consuming. A number of manually curated databases, such as BRENDA (http://www.brenda-enzymes.org), try to index and provide this information; yet the provided data seems to be incomplete. Thus, there is a growing need for automated approaches to extract this information. In this work, we present a system to automatically extract and summarize impact information from protein mutations. Our system extraction module is split into subtasks: organism analysis, mutation detection, protein property extraction and impact analysis. Organisms, as sources of proteins, are required to be extracted to help disambiguation of genes and proteins. Thus, our system extracts and grounds organisms to NCBI. We detect mutation series to correctly ground our detected impacts. Our system also extracts the affected protein properties as well as the magnitude of the effects. The output of our system is populated to an OWL-DL ontology, which can then be queried to provide structured information. The performance of the system is evaluated on both external and internal corpora and databases. The results show the reliability of the approaches. Our Organism extraction system achieves a precision and recall of 95% and 94% and a grounding accuracy of 97.5% on the OT corpus. On the manually annotated corpus of Linneaus-100, the results show a precision and recall of 99% and 97% and grounding with an accuracy of 97.4%. In the impact detection task, our system achieves a precision and recall of 70.4%-71.8% and 71.2%-71.3% on a manually annotated documents. Our system grounds the detected impacts with an accuracy of 70.1%-71.7% on the manually annotated documents and a precision and recall of 57%-57.5% and 82.5%-84.2% against the BRENDA data

    Text Mining Improves Prediction of Protein Functional Sites

    Get PDF
    We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites). The structure analysis was carried out using Dynamics Perturbation Analysis (DPA), which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites) in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions

    BELB: a Biomedical Entity Linking Benchmark

    Full text link
    Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base. It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage knowledge base UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied. We therefore developed BELB, a Biomedical Entity Linking Benchmark, providing access in a unified format to 11 corpora linked to 7 knowledge bases and spanning six entity types: gene, disease, chemical, species, cell line and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Getting More out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics.

    Get PDF
    This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/ outcome models in the UK’s largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors’ own group) who work in text processing for biomedicine and other areas. GATE is available online ,1. under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis

    Community Classification of the Protein Universe

    Get PDF
    Protein family databases are an important resource for biologists seeking to characterise the function of proteins, the structure of their domains, and their localisation within the cell. Operating a protein family database requires the identification of families, and the curation of literature related to the family. This labour is currently performed by skilled professional curators, whose abilities are a scarce resource. In this thesis, I have developed methods to enable some of this labour to be performed by the community of protein sequence similarity search users. In the first chapter, I review the history of protein sequence and protein family databases, and how the abstract concept of a protein family is expressed as a computational model. I review in greater detail the protein family database Pfam, and the software package hmmer, which uses hidden Markov models to search protein sequence databases. In the second chapter, I explore how the quality of computational models for a protein family can be measured, and how these measurements might be used to assess the quality of community-sourced protein family models. I then investigate how a protein sequence similarity search can be rapidly analysed for overlap with existing protein families in Pfam, using locality sensitive hashing. In the third chapter, I discuss the use of literature search in protein family database curation, and the existing literature resources used by protein family database curators. I then develop a system for performing literature search based on protein families, exploiting the manually annotated links between literature and proteins found in the Swiss-Prot subset of the UniProt protein database. In the fourth chapter, I develop a web application for analysing the results of protein sequence similarity searches, using the methods discussed in the second chapter, and for performing literature search based on the results of protein sequence similarity search, using the methods discussed in the third chapter. In the fifth chapter, I develop a web application which applies the methods developed in the third chapter to the task of curation of the protein classification resource, InterPro.My fellowship was funded by the European Molecular Biology Laboratory's International PhD Programm

    Knowledge-driven entity recognition and disambiguation in biomedical text

    Get PDF
    Entity recognition and disambiguation (ERD) for the biomedical domain are notoriously difficult problems due to the variety of entities and their often long names in many variations. Existing works focus heavily on the molecular level in two ways. First, they target scientific literature as the input text genre. Second, they target single, highly specialized entity types such as chemicals, genes, and proteins. However, a wealth of biomedical information is also buried in the vast universe of Web content. In order to fully utilize all the information available, there is a need to tap into Web content as an additional input. Moreover, there is a need to cater for other entity types such as symptoms and risk factors since Web content focuses on consumer health. The goal of this thesis is to investigate ERD methods that are applicable to all entity types in scientific literature as well as Web content. In addition, we focus on under-explored aspects of the biomedical ERD problems -- scalability, long noun phrases, and out-of-knowledge base (OOKB) entities. This thesis makes four main contributions, all of which leverage knowledge in UMLS (Unified Medical Language System), the largest and most authoritative knowledge base (KB) of the biomedical domain. The first contribution is a fast dictionary lookup method for entity recognition that maximizes throughput while balancing the loss of precision and recall. The second contribution is a semantic type classification method targeting common words in long noun phrases. We develop a custom set of semantic types to capture word usages; besides biomedical usage, these types also cope with non-biomedical usage and the case of generic, non-informative usage. The third contribution is a fast heuristics method for entity disambiguation in MEDLINE abstracts, again maximizing throughput but this time maintaining accuracy. The fourth contribution is a corpus-driven entity disambiguation method that addresses OOKB entities. The method first captures the entities expressed in a corpus as latent representations that comprise in-KB and OOKB entities alike before performing entity disambiguation.Die Erkennung und Disambiguierung von Entitäten für den biomedizinischen Bereich stellen, wegen der vielfältigen Arten von biomedizinischen Entitäten sowie deren oft langen und variantenreichen Namen, große Herausforderungen dar. Vorhergehende Arbeiten konzentrieren sich in zweierlei Hinsicht fast ausschließlich auf molekulare Entitäten. Erstens fokussieren sie sich auf wissenschaftliche Publikationen als Genre der Eingabetexte. Zweitens fokussieren sie sich auf einzelne, sehr spezialisierte Entitätstypen wie Chemikalien, Gene und Proteine. Allerdings bietet das Internet neben diesen Quellen eine Vielzahl an Inhalten biomedizinischen Wissens, das vernachlässigt wird. Um alle verfügbaren Informationen auszunutzen besteht der Bedarf weitere Internet-Inhalte als zusätzliche Quellen zu erschließen. Außerdem ist es auch erforderlich andere Entitätstypen wie Symptome und Risikofaktoren in Betracht zu ziehen, da diese für zahlreiche Inhalte im Internet, wie zum Beispiel Verbraucherinformationen im Gesundheitssektor, relevant sind. Das Ziel dieser Dissertation ist es, Methoden zur Erkennung und Disambiguierung von Entitäten zu erforschen, die alle Entitätstypen in Betracht ziehen und sowohl auf wissenschaftliche Publikationen als auch auf andere Internet-Inhalte anwendbar sind. Darüber hinaus setzen wir Schwerpunkte auf oft vernachlässigte Aspekte der biomedizinischen Erkennung und Disambiguierung von Entitäten, nämlich Skalierbarkeit, lange Nominalphrasen und fehlende Entitäten in einer Wissensbank. In dieser Hinsicht leistet diese Dissertation vier Hauptbeiträge, denen allen das Wissen von UMLS (Unified Medical Language System), der größten und wichtigsten Wissensbank im biomedizinischen Bereich, zu Grunde liegt. Der erste Beitrag ist eine schnelle Methode zur Erkennung von Entitäten mittels Lexikonabgleich, welche den Durchsatz maximiert und gleichzeitig den Verlust in Genauigkeit und Trefferquote (precision and recall) balanciert. Der zweite Beitrag ist eine Methode zur Klassifizierung der semantischen Typen von Nomen, die sich auf gebräuchliche Nomen von langen Nominalphrasen richtet und auf einer selbstentwickelten Sammlung von semantischen Typen beruht, die die Verwendung der Nomen erfasst. Neben biomedizinischen können diese Typen auch nicht-biomedizinische und allgemeine, informationsarme Verwendungen behandeln. Der dritte Beitrag ist eine schnelle Heuristikmethode zur Disambiguierung von Entitäten in MEDLINE Kurzfassungen, welche den Durchsatz maximiert, aber auch die Genauigkeit erhält. Der vierte Beitrag ist eine korpusgetriebene Methode zur Disambiguierung von Entitäten, die speziell fehlende Entitäten in einer Wissensbank behandelt. Die Methode wandelt erst die Entitäten, die in einem Textkorpus ausgedrückt aber nicht notwendigerweise in einer Wissensbank sind, in latente Darstellungen um und führt anschließend die Disambiguierung durch

    Development of a text mining approach to disease network discovery

    Get PDF
    Scientific literature is one of the major sources of knowledge for systems biology, in the form of papers, patents and other types of written reports. Text mining methods aim at automatically extracting relevant information from the literature. The hypothesis of this thesis was that biological systems could be elucidated by the development of text mining solutions that can automatically extract relevant information from documents. The first objective consisted in developing software components to recognize biomedical entities in text, which is the first step to generate a network about a biological system. To this end, a machine learning solution was developed, which can be trained for specific biological entities using an annotated dataset, obtaining high-quality results. Additionally, a rule-based solution was developed, which can be easily adapted to various types of entities. The second objective consisted in developing an automatic approach to link the recognized entities to a reference knowledge base. A solution based on the PageRank algorithm was developed in order to match the entities to the concepts that most contribute to the overall coherence. The third objective consisted in automatically extracting relations between entities, to generate knowledge graphs about biological systems. Due to the lack of annotated datasets available for this task, distant supervision was employed to train a relation classifier on a corpus of documents and a knowledge base. The applicability of this approach was demonstrated in two case studies: microRNAgene relations for cystic fibrosis, obtaining a network of 27 relations using the abstracts of 51 recently published papers; and cell-cytokine relations for tolerogenic cell therapies, obtaining a network of 647 relations from 3264 abstracts. Through a manual evaluation, the information contained in these networks was determined to be relevant. Additionally, a solution combining deep learning techniques with ontology information was developed, to take advantage of the domain knowledge provided by ontologies. This thesis contributed with several solutions that demonstrate the usefulness of text mining methods to systems biology by extracting domain-specific information from the literature. These solutions make it easier to integrate various areas of research, leading to a better understanding of biological systems

    The Evolution of Language Universals: Optimal Design and Adaptation

    Get PDF
    Inquiry into the evolution of syntactic universals is hampered by severe limitations on the available evidence. Theories of selective function nevertheless lead to predictions of local optimaliiy that can be tested scientifically. This thesis refines a diagnostic, originally proposed by Parker and Maynard Smith (1990), for identifying selective functions on this basis and applies it to the evolution of two syntactic universals: (I) the distinction between open and closed lexical classes, and (2) nested constituent structure. In the case of the former, it is argued that the selective role of the closed class items is primarily to minimise the amount of redundancy in the lexicon. In the case of the latter, the emergence of nested phrase structure is argued to have been a by-product of selection for the ability to perform insertion operations on sequences - a function that plausibly pre-dated the emergence of modem language competence. The evidence for these claims is not just that these properties perform plausibly fitness-related functions, but that they appear to perform them in a way that is improbably optimal. A number of interesting findings follow when examining the selective role of the closed classes. In particular, case, agreement and the requirement that sentences have subjects are expected consequences of an optimised lexicon, the theory thereby relating these properties to natural selection for the first time. It also motivates the view that language variation is confined to parameters associated with closed class items, in turn explaining why parameter confiicts fail to arise in bilingualism. The simplest representation of sequences that is optimised for efficient insertions can represent both nested constituent structure and long-distance dependencies in a unified way, thus suggesting that movement is intrinsic to the representation of constituency rather than an 'imperfection'. The basic structure of phrases also follows from this representation and helps to explain the interaction between case and theta assignment. These findings bring together a surprising array of phenomena, reinforcing its correctness as the representational basis of syntactic structures. The diagnostic overcomes shortcomings in the approach of Pinker and Bloom (1990), who argued that the appearance of 'adaptive complexity' in the design of a trait could be used as evidence of its selective function, but there is no reason to expect the refinements of natural selection to increase complexity in any given case. Optimality considerations are also applied in this thesis to filter theories of the nature of unobserved linguistic representations as well as theories of their functions. In this context, it is argued that, despite Chomsky's (1995) resistance to the idea, it is possible to motivate the guiding principles of the Minimalist Program in terms of evolutionary optimisation, especially if we allow the possibility that properties of language were selected for non-communicative functions and that redundancy is sometimes costly rather than beneficial
    corecore