9 research outputs found

    PaperMaker: validation of biomedical scientific publications

    Get PDF
    Motivation: The automatic analysis of scientific literature can support authors in writing their manuscripts

    MedEvi: Retrieving textual evidence of relations between biomedical concepts from Medline

    Get PDF
    Summary: Search engines running on MEDLINE abstracts have been widely used by biologists to find publications that are related to their research. The existing search engines such as PubMed, however, have limitations when applied for the task of seeking textual evidence of relations between given concepts. The limitations are mainly due to the problem that the search engines do not effectively deal with multi-term queries which may imply semantic relations between the terms. To address this problem, we present MedEvi, a novel search engine that imposes positional restriction on occurrences matching multi-term queries, based on the observation that terms with semantic relations which are explicitly stated in text are not found too far from each other. MedEvi further identifies additional keywords of biological and statistical significance from local context of matching occurrences in order to help users reformulate their queries for better results

    Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Evaluation of Word Sense Disambiguation (WSD) methods in the biomedical domain is difficult because the available resources are either too small or too focused on specific types of entities (e.g. diseases or genes). We present a method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE. We demonstrate the use of this method by developing such a data set, called MSH WSD.</p> <p>Methods</p> <p>In our method, the Metathesaurus is first screened to identify ambiguous terms whose possible senses consist of two or more MeSH headings. We then use each ambiguous term and its corresponding MeSH heading to extract MEDLINE citations where the term and only one of the MeSH headings co-occur. The term found in the MEDLINE citation is automatically assigned the UMLS CUI linked to the MeSH heading. Each instance has been assigned a UMLS Concept Unique Identifier (CUI). We compare the characteristics of the MSH WSD data set to the previously existing NLM WSD data set.</p> <p>Results</p> <p>The resulting MSH WSD data set consists of 106 ambiguous abbreviations, 88 ambiguous terms and 9 which are a combination of both, for a total of 203 ambiguous entities. For each ambiguous term/abbreviation, the data set contains a maximum of 100 instances per sense obtained from MEDLINE.</p> <p>We evaluated the reliability of the MSH WSD data set using existing knowledge-based methods and compared their performance to that of the results previously obtained by these algorithms on the pre-existing data set, NLM WSD. We show that the knowledge-based methods achieve different results but keep their relative performance except for the Journal Descriptor Indexing (JDI) method, whose performance is below the other methods.</p> <p>Conclusions</p> <p>The MSH WSD data set allows the evaluation of WSD algorithms in the biomedical domain. Compared to previously existing data sets, MSH WSD contains a larger number of biomedical terms/abbreviations and covers the largest set of UMLS Semantic Types. Furthermore, the MSH WSD data set has been generated automatically reusing already existing annotations and, therefore, can be regenerated from subsequent UMLS versions.</p

    Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A protein annotation database, such as the Universal Protein Resource knowledge base (UniProtKb), is a valuable resource for the validation and interpretation of predicted 3D structure patterns in proteins. Existing studies have focussed on point mutation extraction methods from biomedical literature which can be used to support the time consuming work of manual database curation. However, these methods were limited to point mutation extraction and do not extract features for the annotation of proteins at the residue level.</p> <p>Results</p> <p>This work introduces a system that identifies protein residues in MEDLINE abstracts and annotates them with features extracted from the context written in the surrounding text. MEDLINE abstract texts have been processed to identify protein mentions in combination with taxonomic species and protein residues (F1-measure 0.52). The identified protein-species-residue triplets have been validated and benchmarked against reference data resources (UniProtKb, average F1-measure of 0.54). Then, contextual features were extracted through shallow and deep parsing and the features have been classified into predefined categories (F1-measure ranges from 0.15 to 0.67). Furthermore, the feature sets have been aligned with annotation types in UniProtKb to assess the relevance of the annotations for ongoing curation projects. Altogether, the annotations have been assessed automatically and manually against reference data resources.</p> <p>Conclusion</p> <p>This work proposes a solution for the automatic extraction of functional annotation for protein residues from biomedical articles. The presented approach is an extension to other existing systems in that a wider range of residue entities are considered and that features of residues are extracted as annotations.</p

    Systemy anotacji korpusów jezykowych, korpusów równoleglych i porównywalnych.

    No full text

    Analiza morfologiczno-składniowa korpusów (‘Part-of-speech tagging’).

    No full text

    Pathway Enrichment Based on Text Mining and Its Validation on Carotenoid and Vitamin A Metabolism

    No full text
    Abstract Carotenoid metabolism is relevant to the prevention of various diseases. Although the main actors in this metabolic pathway are known, our understanding of the pathway is still incomplete. The information on the carotenoids is scattered in the large and growing body of scientific literature. We designed a text-mining work flow to enrich existing pathways. It has been validated on the vitamin A pathway, which is a well-studied part of the carotenoid metabolism. In this study we used the vitamin A metabolism pathway as it has been described by an expert team on carotenoid metabolism from the European network of excellence in Nutrigenomics (NuGO). This work flow uses an initial set of publications cited in a review paper (1,191 publications), enlarges this corpus with Medline abstracts (13,579 documents), and then extracts the key terminology from all relevant publications. Domain experts validated the intermediate and final results of our text-mining work flow. With our approach we were able to enrich the pathway representing vitamin A metabolism. We found 37 new and relevant terms from a total of 89,086 terms, which have been qualified for inclusion in the analyzed pathway. These 37 terms have been assessed manually and as a result 13 new terms were then added as entities to the pathway. Another 14 entities belonged to other pathways, which could form the link of these pathways with the vitamin A pathway. The remaining 10 terms were classified as biomarkers or nutrients. Automatic literature analysis improves the enrichment of pathways with entities already described in the scientific literature
    corecore