Porting a lexicalized-grammar parser to the biomedical domain

Clark, Stephen; Rimell, Laura

research

Porting a lexicalized-grammar parser to the biomedical domain

Authors: Stephen Clark
Laura Rimell
Publication date: 31 October 2009
Publisher: Elsevier Inc.
Doi

Abstract

AbstractThis paper introduces a state-of-the-art, linguistically motivated statistical parser to the biomedical text mining community, and proposes a method of adapting it to the biomedical domain requiring only limited resources for data annotation. The parser was originally developed using the Penn Treebank and is therefore tuned to newspaper text. Our approach takes advantage of a lexicalized grammar formalism, Combinatory Categorial Grammar (ccg), to train the parser at a lower level of representation than full syntactic derivations. The ccg parser uses three levels of representation: a first level consisting of part-of-speech (pos) tags; a second level consisting of more fine-grained ccg lexical categories; and a third, hierarchical level consisting of ccg derivations. We find that simply retraining the pos tagger on biomedical data leads to a large improvement in parsing performance, and that using annotated data at the intermediate lexical category level of representation improves parsing accuracy further. We describe the procedure involved in evaluating the parser, and obtain accuracies for biomedical data in the same range as those reported for newspaper text, and higher than those previously reported for the biomedical resource on which we evaluate. Our conclusion is that porting newspaper parsers to the biomedical domain, at least for parsers which use lexicalized grammars, may not be as difficult as first thought

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Elsevier - Publisher Connector

Last time updated on 05/06/2019

Elsevier - Publisher Connector

Last time updated on 05/05/2017