53 research outputs found

    Learning to extract relations for protein annotation

    Get PDF
    Motivation: Protein annotation is a task that describes protein X in terms of topic Y. Usually, this is constructed using information from the biomedical literature. Until now, most of literature-based protein annotation work has been done manually by human annotators. However, as the number of biomedical papers grows ever more rapidly, manual annotation becomes more difficult, and there is increasing need to automate the process. Recently, information extraction (IE) has been used to address this problem. Typically, IE requires pre-defined relations and hand-crafted IE rules or annotated corpora, and these requirements are difficult to satisfy in real-world scenarios such as in the biomedical domain. In this article, we describe an IE system that requires only sentences labelled according to their relevance or not to a given topic by domain experts. Results: We applied our system to meet the annotation needs of a well-known protein family database; the results show that our IE system can annotate proteins with a set of extracted relations by learning relations and IE rules for disease, function and structure from only relevant and irrelevant sentences. Contact: [email protected]

    Evaluation and cross-comparison of lexical entities of biological interest (LexEBI)

    Get PDF
    MOTIVATION: Biomedical entities, their identifiers and names, are essential in the representation of biomedical facts and knowledge. In the same way, the complete set of biomedical and chemical terms, i.e. the biomedical "term space" (the "Lexeome"), forms a key resource to achieve the full integration of the scientific literature with biomedical data resources: any identified named entity can immediately be normalized to the correct database entry. This goal does not only require that we are aware of all existing terms, but would also profit from knowing all their senses and their semantic interpretation (ambiguities, nestedness). RESULT: This study compiles a resource for lexical terms of biomedical interest in a standard format (called "LexEBI"), determines the overall number of terms, their reuse in different resources and the nestedness of terms. LexEBI comprises references for protein and gene entries and their term variants and chemical entities amongst other terms. In addition, disease terms have been identified from Medline and PubmedCentral and added to LexEBI. Our analysis demonstrates that the baseforms of terms from the different semantic types show only little polysemous use. Nonetheless, the term variants of protein and gene names (PGNs) frequently contain species mentions, which should have been avoided according to protein annotation guidelines. Furthermore, the protein and gene entities as well as the chemical entities, both do comprise enzymes leading to hierarchical polysemy, and a large portion of PGNs make reference to a chemical entity. Altogether, according to our analysis based on the Medline distribution, 401,869 unique PGNs in the documents contain a reference to 25,022 chemical entities, 3,125 disease terms or 1,576 species mentions. CONCLUSION: LexEBI delivers the complete biomedical and chemical Lexeome in a standardized representation (http://www.ebi.ac.uk/Rebholz-srv/LexEBI/). The resource provides the disease terms as open source content, and fully interlinks terms across resources

    Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources

    Get PDF
    Motivation The identification of protein and gene names (PGNs) from the scientific literature requires semantic resources: Terminological and lexical resources deliver the term candidates into PGN tagging solutions and the gold standard corpora (GSC) train them to identify term parameters and contextual features.Ideally all three resources, i.e.~corpora, lexica and taggers, cover the same domain knowledge, and thus support identification of the same types of PGNs and cover all of them.Unfortunately, none of the three serves as a predominant standard and for this reason it is worth exploring, how these three resources comply with each other.We systematically compare different PGN taggers against publicly available corpora and analyze the impact of the included lexical resource in their performance.In particular, we determine the performance gains through false positive filtering, which contributes to the disambiguation of identified PGNs. RESULTS: In general, machine learning approaches (ML-Tag) for PGN tagging show higher F1-measureperformance against the BioCreative-II and Jnlpba GSCs (exact matching), whereas the lexicon basedapproaches (LexTag) in combination with disambiguation methods show better results on FsuPrgeand PennBio. The ML-Tag solutions balance precision and recall, whereas the LexTag solutions havedifferent precision and recall profiles at the same F1-measure across all corpora. Higher recall isachieved with larger lexical resources, which also introduce more noise (false positive results). TheML-Tag solutions certainly perform best, if the test corpus is from the same GSC as the trainingcorpus. As expected, the false negative errors characterize the test corpora and - on the other hand- the profiles of the false positive mistakes characterize the tagging solutions. Lex-Tag solutions thatare based on a large terminological resource in combination with false positive filtering produce betterresults, which, in addition, provide concept identifiers from a knowledge source in contrast to ML-Tagsolutions. CONCLUSION: The standard ML-Tag solutions achieve high performance, but not across all corpora, and thus shouldbe trained using several different corpora to reduce possible biases. The LexTag solutions havedifferent profiles for their precision and recall performance, but with similar F1-measure. This resultis surprising and suggests that they cover a portion of the most common naming standards, but copedifferently with the term variability across the corpora. The false positive filtering applied to LexTagsolutions does improve the results by increasing their precision without compromising significantlytheir recall. The harmonisation of the annotation schemes in combination with standardized lexicalresources in the tagging solutions will enable their comparability and will pave the way for a sharedstandard

    Exaggeration of wrinkles after botulinum toxin injection for forehead horizontal lines

    Get PDF
    There have been no long-term complications or life-threatening adverse effects related to botulinum toxin treatment for any cosmetic indications. Nevertheless, there are well-known, mild side effects of botulinum toxin treatment on the upper face, though most of them are self limited with time. However, excluding brow ptosis, reports about site specific side effects are few and anecdotal. We experienced cases of exaggeration of wrinkles after botulinum toxin injection for forehead horizontal lines, and report them here. In our cases, new appearance of a noticeable glabellar protrusion following botulinum toxin injection on the forehead was observed in 2 patients. Also, a new deep wrinkle on one side of the forehead just above the eyebrow appeared in another 2 patients. The exaggerated wrinkles nearly disappeared without treatment by week 4 in all subjects. These exaggerations of wrinkles may be caused by hyperactivity and overcompensation of untreated muscles. With the increasing availability of diverse botulinum toxin for cosmetic purposes, physicians and patients should be aware of this temporary change after therapeutic injections. We recommend explaining this possible effect prior to injection, for better understanding of treatment for cosmetic indications.OAIID:oai:osos.snu.ac.kr:snu2011-01/102/2008000790/3SEQ:3PERF_CD:SNU2011-01EVAL_ITEM_CD:102USER_ID:2008000790ADJUST_YN:NEMP_ID:A079501DEPT_CD:801CITE_RATE:.531FILENAME:botox and wrinkles.pdfDEPT_NM:의학과SCOPUS_YN:NCONFIRM:

    Database citation in full text biomedical articles.

    Get PDF
    Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services

    Corpus-Based Learning of Compound Noun Indexing

    No full text
    We present a corpus-based learning method that can index diverse types of compound nouns using rules automatically extracted from a large tagged corpus. The automatic learning method shows about the same performance compared with the manual linguistic approach but is more portable and requires no human efforts. IR&NLP workshop Corpus-Based Learning of Compound Noun Indexing July 12, 2000 Paper ID Code: IR&NLP workshop Authors: Byung-Kwan Kwak and Jee-Hyub Kim and Geunbae Lee NLP Lab., Dept. of CSE Pohang University of Science & Technology (POSTECH) San 31, Hyoja-Dong, Pohang, 790-784, Korea fnerguri,[email protected] ----------------------------------------------------------------------------- Jung Yun Seo NLP Lab., Dept. of Computer Science Sogang University Sinsu-dong 1, Mapo-gu, Seoul, Korea [email protected] Topic Area(s): Keywords: compound noun, indexing, corpus-based learning, automatic rule extraction, filtering Which Session: T1, T2, T3, T4, or G (you must choose only one)? Word Count: 3,036 Under consideration for other conferences (specify)? None Abstract We present a corpus-based learning method that can index diverse types of compound nouns using rules automatically extracted from a large tagged corpus. The automatic learning method shows about the same performance compared with the manual linguistic approach but is more portable and requires no human efforts. Abstract In this paper, we present a corpus-based learning method that can index diverse types of compound nouns using rules automatically extracted from a large tagged corpus. We develop an efficient way of extracting the compound noun indexing rules automatically and perform extensive experiments to evaluate our indexing rules. The automatic learning method shows a..
    corecore