15 research outputs found

    Challenges for automatically extracting molecular interactions from full-text articles

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The increasing availability of full-text biomedical articles will allow more biomedical knowledge to be extracted automatically with greater reliability. However, most Information Retrieval (IR) and Extraction (IE) tools currently process only abstracts. The lack of corpora has limited the development of tools that are capable of exploiting the knowledge in full-text articles. As a result, there has been little investigation into the advantages of full-text document structure, and the challenges developers will face in processing full-text articles.</p> <p>Results</p> <p>We manually annotated passages from full-text articles that describe interactions summarised in a Molecular Interaction Map (MIM). Our corpus tracks the process of identifying facts to form the MIM summaries and captures any factual dependencies that must be resolved to extract the fact completely. For example, a fact in the results section may require a synonym defined in the introduction. The passages are also annotated with negated and coreference expressions that must be resolved.</p> <p>We describe the guidelines for identifying relevant passages and possible dependencies. The corpus includes 2162 sentences from 78 full-text articles. Our corpus analysis demonstrates the necessity of full-text processing; identifies the article sections where interactions are most commonly stated; and quantifies the proportion of interaction statements requiring coherent dependencies. Further, it allows us to report on the relative importance of identifying synonyms and resolving negated expressions. We also experiment with an oracle sentence retrieval system using the corpus as a gold-standard evaluation set.</p> <p>Conclusion</p> <p>We introduce the MIM corpus, a unique resource that maps interaction facts in a MIM to annotated passages within full-text articles. It is an invaluable case study providing guidance to developers of biomedical IR and IE systems, and can be used as a gold-standard evaluation set for full-text IR tasks.</p

    Accelerating COVID-19 research with graph mining and transformer-based learning

    Full text link
    In 2020, the White House released the, "Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset," wherein artificial intelligence experts are asked to collect data and develop text mining techniques that can help the science community answer high-priority scientific questions related to COVID-19. The Allen Institute for AI and collaborators announced the availability of a rapidly growing open dataset of publications, the COVID-19 Open Research Dataset (CORD-19). As the pace of research accelerates, biomedical scientists struggle to stay current. To expedite their investigations, scientists leverage hypothesis generation systems, which can automatically inspect published papers to discover novel implicit connections. We present an automated general purpose hypothesis generation systems AGATHA-C and AGATHA-GP for COVID-19 research. The systems are based on graph-mining and the transformer model. The systems are massively validated using retrospective information rediscovery and proactive analysis involving human-in-the-loop expert analysis. Both systems achieve high-quality predictions across domains (in some domains up to 0.97% ROC AUC) in fast computational time and are released to the broad scientific community to accelerate biomedical research. In addition, by performing the domain expert curated study, we show that the systems are able to discover on-going research findings such as the relationship between COVID-19 and oxytocin hormone

    Event extraction of bacteria biotopes: a knowledge-intensive NLP-based approach

    Get PDF
    International audienceBackground: Bacteria biotopes cover a wide range of diverse habitats including animal and plant hosts, natural, medical and industrial environments. The high volume of publications in the microbiology domain provides a rich source of up-to-date information on bacteria biotopes. This information, as found in scientific articles, is expressed in natural language and is rarely available in a structured format, such as a database. This information is of great importance for fundamental research and microbiology applications (e.g., medicine, agronomy, food, bioenergy). The automatic extraction of this information from texts will provide a great benefit to the field

    Towards new information resources for public health: From WordNet to MedicalWordNet

    Get PDF
    In the last two decades, WORDNET has evolved as the most comprehensive computational lexicon of general English. In this article, we discuss its potential for supporting the creation of an entirely new kind of information resource for public health, viz. MEDICAL WORDNET. This resource is not to be conceived merely as a lexical extension of the original WORDNET to medical terminology; indeed, there is already a considerable degree of overlap between WORDNET and the vocabulary of medicine. Instead, we propose a new type of repository, consisting of three large collections of (1) medically relevant word forms, structured along the lines of the existing Princeton WORDNET; (2) medically validated propositions, referred to here as medical facts, which will constitute what we shall call MEDICAL FACTNET; and (3) propositions reflecting laypersons’ medical beliefs, which will constitute what we shall call the MEDICAL BELIEFNET. We introduce a methodology for setting up the MEDICAL WORDNET. We then turn to the discussion of research challenges that have to be met in order to build this new type of information resource

    Text mining for metabolic reaction extraction from scientific literature

    Get PDF
    Science relies on data in all its different forms. In molecular biology and bioinformatics in particular large scale data generation has taken centre stage in the form of high-throughput experiments. In line with this exponential increase of experimental data has been the near exponential growth of scientific publications. Yet where classical data mining techniques are still capable of coping with this deluge in structured data (Chapter 2), access of information found in scientific literature is still limited to search engines allowing searches on the level keywords, titles and abstracts. However, large amounts of knowledge about biological entities and their relations are held within the body of articles. When extracted, this data can be used as evidence for existing knowledge or hypothesis generation making scientific literature a valuable scientific resource. To unlock the information inside the articles requires a dedicated set of techniques and approaches tailored to the unstructured nature of free text. Analogous to the field of data mining for the analysis of structured data, the field of text mining has emerged for unstructured text and a number of applications has been developed in that field. This thesis is about text mining in the field of metabolomics. The work focusses on strategies for accessing large collections of scientific text and on the text mining steps required to extract metabolic reactions and their constituents, enzymes and metabolites, from scientific text. Metabolic reactions are important for our understanding of metabolic processes within cells and that information provides an important link between genotype phenotype. Furthermore information about metabolic reactions stored in databases is far from complete making it an excellent target for our text mining application. In order to access the scientific publications for further analysis they can be used as flat text or loaded into database systems. In Chapter 2we assessed and discussed the capabilities and performance of XML-type database systems to store and access very large collections of XML-type documents in the form of the Medline corpus, a collection of more than 20 million of scientific abstracts. XML data formats are common in the field of bioinformatics and are also at the core of most web services. With the increasing amount of data stored in XML comes the need for storing and accessing the data. The database systems were evaluated on a number of aspects broadly ranging from technical requirements to ease-of-use and performance. The performance of the different XML-type database systems was measured Medline abstract collections of increasing size and with a number of different queries. One of the queries assessed the capabilities of each database system to search the full-text of each abstract, which would allow access to the information within the text without further text analysis. The results show that all database systems cope well with the small and medium dataset, but that the full dataset remains a challenge. Also the query possibilities varied greatly across all studied databases. This led us to conclude that the performances and possibilities of the different database types vary greatly, also depending on the type of research question. There is no single system that outperforms the others; instead different circumstances can lead to a different optimal solution. Some of these scenarios are presented in the chapter. Among the conclusions of Chapter 2is that conventional data mining techniques do not work for the natural language part of a publication beyond simple retrieval queries based on pattern matching. The natural language used in written text is too unstructured for that purpose and requires dedicated text mining approaches, the main research topic of this thesis. Two major tasks of text mining are named entity recognition, the identification of relevant entities in the text, and relation extraction, the identification of relations between those named entities. For both text mining tasks many different techniques and approaches have been developed. For the named entity recognition of enzymes and metabolites we used a dictionary-based approach (Chapter 3) and for metabolic reaction extraction a full grammar approach (Chapter 4). In Chapter 3we describe the creation of two thesauri, one for enzymes and one for metabolites with the specific goal of allowing named entity identification, the mapping of identified synonyms to a common identifier, for metabolic reaction extraction. In the case of the enzyme thesaurus these identifiers are Enzyme Nomenclature numbers (EC number), in the case of the metabolite thesaurus KEGG metabolite identifiers. These thesauri are applied to the identification of enzymes and metabolites in the text mining approach of Chapter 4. Both were created from existing data sources by a series of automated steps followed by manual curation. Compared to a previously published chemical thesaurus, created entirely with automated steps, our much smaller metabolite thesaurus performed on the same level for F-measure with a slightly higher precision. The enzyme thesaurus produced results equal to our metabolite thesaurus. The compactness of our thesauri permits the manual curation step important in guaranteeing accuracy of the thesaurus contents, whereas creation from existing resources by automated means limits the effort required for creation. We concluded that our thesauri are compact and of high quality, and that this compactness does not greatly impact recall. In Chapter 4we studied the applicability and performance of a full parsing approach using the two thesauri described in Chapter 3 for the extraction of metabolic reactions from scientific full-text articles. For this we developed a text mining pipeline built around a modified dependency parser from the AGFL grammar lab using a pattern-based approach to extract metabolic reactions from the parsing output. Results of a comparison to a modified rule-based approach by Czarnecki et al.using three previously described metabolic pathways from the EcoCyc database show a slightly lower recall compared to the rule-based approach, but higher precision. We concluded that despite its current recall our full parsing approach to metabolic reaction extraction has high precision and potential to be used to (re-)construct metabolic pathways in an automated setting. Future improvements to the grammar and relation extraction rules should allow reactions to be extracted with even higher specificity. To identify potential improvements to the recall, the effect of a number of text pre-processing steps on the performance was tested in a number of experiments. The one experiment that had the most effect on performance was the conversion of schematic chemical formulas to syntactic complete sentences allowing them to be analysed by the parser. In addition to the improvements to the text mining approach described in Chapter 4I make suggestions in Chapter 5 for potential improvements and extensions to our full parsing approach for metabolic reaction extraction. Core focus here is the increase of recall by optimising each of the steps required for the final goal of extracting metabolic reactions from the text. Some of the discussed improvements are to increase the coverage of the used thesauri, possibly with specialist thesauri depending on the analysed literature. Another potential target is the grammar, where there is still room to increase parsing success by taking into account the characteristics of biomedical language. On a different level are suggestions to include some form of anaphora resolution and across sentence boundary search to increase the amount of information extracted from literature. In the second part of Chapter 5I make suggestions as to how to maximise the information gained from the text mining results. One of the first steps should be integration with other biomedical databases to allow integration with existing knowledge about metabolic reactions and other biological entities. Another aspect is some form of ranking or weighting of the results to be able to distinguish between high quality results useful for automated analyses and lower quality results still useful for manual approaches. Furthermore I provide a perspective on the necessity of computational literature analysis in the form of text mining. The main reasoning here is that human annotators cannot keep up with the amount of publications so that some form of automated analysis is unavoidable. Lastly I discuss the role of text mining in bioinformatics and with that also the accessibility of both text mining results and the literature resources necessary to create them. An important requirement for the future of text mining is that the barriers around high-throughput access to literature for text mining applications have to be removed. With regards to accessing text mining results, there is a long way to go for many applications, including ours, before they can be used directly by biologists. A major factor is that these applications rarely feature a suitable user interface and easy to use setup. To conclude, I see the main role of a text mining system like ours mainly in gathering evidence for existing knowledge and giving insights into the nuances of the research landscape of a given topic. When using the results of our reaction extraction system for the identification of ‘new’ reactions it is important to go back to the actual evidence presented for extra validations and to cross-validate the predictions with other resources or experiments. Ideally text mining will be used for generation of hypotheses, in which the researcher uses text mining findings to get ideas on, in our case, new connections between metabolites and enzymes; subsequently the researcher needs to go back to the original texts for further study. In this role text mining is an essential tool on the workbench of the molecular biologist.</p
    corecore