305 research outputs found

    A critical review of PASBio's argument structures for biomedical verbs

    Get PDF
    BACKGROUND: Propositional representations of biomedical knowledge are a critical component of most aspects of semantic mining in biomedicine. However, the proper set of propositions has yet to be determined. Recently, the PASBio project proposed a set of propositions and argument structures for biomedical verbs. This initial set of representations presents an opportunity for evaluating the suitability of predicate-argument structures as a scheme for representing verbal semantics in the biomedical domain. Here, we quantitatively evaluate several dimensions of the initial PASBio propositional structure repository. RESULTS: We propose a number of metrics and heuristics related to arity, role labelling, argument realization, and corpus coverage for evaluating large-scale predicate-argument structure proposals. We evaluate the metrics and heuristics by applying them to PASBio 1.0. CONCLUSION: PASBio demonstrates the suitability of predicate-argument structures for representing aspects of the semantics of biomedical verbs. Metrics related to theta-criterion violations and to the distribution of arguments are able to detect flaws in semantic representations, given a set of predicate-argument structures and a relatively small corpus annotated with them

    Bootstrapping a Verb Lexicon for Biomedical Information Extraction

    Get PDF
    The accurate extraction of information from texts requires both syntactic and semantic resources. We are developing a verb dictionary for use in the processing of biomedical texts that includes both syntactic subcategorisation frames and semantic event frames, and links them together. In this paper, we describe the acquisition of syntactic subcategorisation frames from a large corpus of abstracts of the subject of E. Coli, together with the extraction of linguistic event frames from a subset of this corpus, in which the biological process of E. coli gene regulation has been linguistically annotated by a group of biologists. Finally, we report on work carried out to link the syntactic and semantic information together, by mapping syntactic arguments of subcategorisation frames to semantic arguments of the event frames

    Medical WordNet: A new methodology for the construction and validation of information resources for consumer health

    Get PDF
    A consumer health information system must be able to comprehend both expert and non-expert medical vocabulary and to map between the two. We describe an ongoing project to create a new lexical database called Medical WordNet (MWN), consisting of medically relevant terms used by and intelligible to non-expert subjects and supplemented by a corpus of natural-language sentences that is designed to provide medically validated contexts for MWN terms. The corpus derives primarily from online health information sources targeted to consumers, and involves two sub-corpora, called Medical FactNet (MFN) and Medical BeliefNet (MBN), respectively. The former consists of statements accredited as true on the basis of a rigorous process of validation, the latter of statements which non-experts believe to be true. We summarize the MWN / MFN / MBN project, and describe some of its applications

    Interchanging lexical resources on the Semantic Web

    Get PDF
    Lexica and terminology databases play a vital role in many NLP applications, but currently most such resources are published in application-specific formats, or with custom access interfaces, leading to the problem that much of this data is in ‘‘data silos’’ and hence difficult to access. The Semantic Web and in particular the Linked Data initiative provide effective solutions to this problem, as well as possibilities for data reuse by inter-lexicon linking, and incorporation of data categories by dereferencable URIs. The Semantic Web focuses on the use of ontologies to describe semantics on the Web, but currently there is no standard for providing complex lexical information for such ontologies and for describing the relationship between the lexicon and the ontology. We present our model, lemon, which aims to address these gap

    Semantic role labeling for protein transport predicates

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Automatic semantic role labeling (SRL) is a natural language processing (NLP) technique that maps sentences to semantic representations. This technique has been widely studied in the recent years, but mostly with data in newswire domains. Here, we report on a SRL model for identifying the semantic roles of biomedical predicates describing protein transport in GeneRIFs – manually curated sentences focusing on gene functions. To avoid the computational cost of syntactic parsing, and because the boundaries of our protein transport roles often did not match up with syntactic phrase boundaries, we approached this problem with a word-chunking paradigm and trained support vector machine classifiers to classify words as being at the beginning, inside or outside of a protein transport role.</p> <p>Results</p> <p>We collected a set of 837 GeneRIFs describing movements of proteins between cellular components, whose predicates were annotated for the semantic roles AGENT, PATIENT, ORIGIN and DESTINATION. We trained these models with the features of previous word-chunking models, features adapted from phrase-chunking models, and features derived from an analysis of our data. Our models were able to label protein transport semantic roles with 87.6% precision and 79.0% recall when using manually annotated protein boundaries, and 87.0% precision and 74.5% recall when using automatically identified ones.</p> <p>Conclusion</p> <p>We successfully adapted the word-chunking classification paradigm to semantic role labeling, applying it to a new domain with predicates completely absent from any previous studies. By combining the traditional word and phrasal role labeling features with biomedical features like protein boundaries and MEDPOST part of speech tags, we were able to address the challenges posed by the new domain data and subsequently build robust models that achieved F-measures as high as 83.1. This system for extracting protein transport information from GeneRIFs performs well even with proteins identified automatically, and is therefore more robust than the rule-based methods previously used to extract protein transport roles.</p

    PASBio: predicate-argument structures for event extraction in molecular biology

    Get PDF
    Background: The exploitation of information extraction (IE), a technology aiming to provide instances of structured representations from free-form text, has been rapidly growing within the molecular biology (MB) research community to keep track of the latest results reported in literature. IE systems have traditionally used shallow syntactic patterns for matching facts in sentences but such approaches appear inadequate to achieve high accuracy in MB event extraction due to complex sentence structure. A consensus in the IE community is emerging on the necessity for exploiting deeper knowledge structures such as through the relations between a verb and its arguments shown by predicate-argument structure (PAS). PAS is of interest as structures typically correspond to events of interest and their participating entities. For this to be realized within IE a key knowledge component is the definition of PAS frames. PAS frames for non-technical domains such as newswire are already being constructed in several projects such as PropBank, VerbNet, and FrameNet. Knowledge from PAS should enable more accurate applications in several areas where sentence understanding is required like machine translation and text summarization. In this article, we explore the need to adapt PAS for the MB domain and specify PAS frames to support IE, as well as outlining the major issues that require consideration in their construction. Results: We introduce PASBio by extending a model based on PropBank to the MB domain. The hypothesis we explore is that PAS holds the key for understanding relationships describing the roles of genes and gene products in mediating their biological functions. We chose predicates describing gene expression, molecular interactions and signal transduction events with the aim of covering a number of research areas in MB. Analysis was performed on sentences containing a set of verbal predicates from MEDLINE and full text journals. Results confirm the necessity to analyze PAS specifically for MB domain. Conclusions: At present PASBio contains the analyzed PAS of over 30 verbs, publicly available on the Internet for use in advanced applications. In the future we aim to expand the knowledge base to cover more verbs and the nominal form of each predicate

    Semantic frames and semantic networks in the Health Science Corpus

    Full text link
    [eng] The aim of this paper is to apply frame semantics principles to the analysis of a specialized corpus, the Health Science Corpus, implemented in the lexical database SciE-Lex. Taking FrameNet as the basis for this research, I will assign frame semantic features to Scie-Lex data in order to highlight the shared semantic and syntactic background of the related words in the biomedical register, give motivation to their patterns of collocates and establish frame-based semantic networks of related lexical units.[spa] El objetivo de este artículo es aplicar los principios de la semántica de marcos al análisis de un corpus especializado, el Health Science Corpus, implementado en la base de datos léxica SciE-Lex. Tomando FrameNet como base para esta investigación, se aplica la semántica de marcos a los datos de Scie-Lex para destacar los aspectos sintácticos y semánticos communes de los términos del registro biomédico, motivar sus patrones combinatorios y establecer redes semánticas basadas en marcos

    Semantic frames and semantic networks in the Health Science Corpus

    Get PDF
    The aim of this paper is to apply frame semantics principles to the analysis of a specialized corpus, the Health Science Corpus, implemented in the lexical data b ase SciE-Lex. Taking FrameNet as the basis for this research, I will assign frame semantic features to Scie-Lex data in order to highlight the shared semantic and syntactic background of the related words in the biomedical register, give motivation to their patterns of collocates and establish frame-based semantic networks of related lexical units.El objetivo de este artículo es aplicar los principios de la semántica de marcos al análisis de un corpus especializado, el Health Science Corpus, implementado en la base de datos léxica SciE-Lex. Tomando FrameNet como base para esta investigación, se aplica la semántica de marcos a los datos de Scie-Lex para destacar los aspectos sintácticos y semánticos communes de los términos del registro biomédico, motivar sus patrones combinatorios y establecer redes semánticas basadas en marcos

    Nominalization and Alternations in Biomedical Language

    Get PDF
    Background: This paper presents data on alternations in the argument structure of common domain-specific verbs and their associated verbal nominalizations in the PennBioIE corpus. Alternation is the term in theoretical linguistics for variations in the surface syntactic form of verbs, e.g. the different forms of stimulate in FSH stimulates follicular development and follicular development is stimulated by FSH. The data is used to assess the implications of alternations for biomedical text mining systems and to test the fit of the sublanguage model to biomedical texts. Methodology/Principal Findings: We examined 1,872 tokens of the ten most common domain-specific verbs or their zerorelated nouns in the PennBioIE corpus and labelled them for the presence or absence of three alternations. We then annotated the arguments of 746 tokens of the nominalizations related to these verbs and counted alternations related to the presence or absence of arguments and to the syntactic position of non-absent arguments. We found that alternations are quite common both for verbs and for nominalizations. We also found a previously undescribed alternation involving an adjectival present participle. Conclusions/Significance: We found that even in this semantically restricted domain, alternations are quite common, and alternations involving nominalizations are exceptionally diverse. Nonetheless, the sublanguage model applies to biomedica
    corecore