Performance of the Charniak-Lease parser on biological text using different training corpora

Alison V. Callahan; Michel Dumontier

research

Performance of the Charniak-Lease parser on biological text using different training corpora

Authors: Alison V. Callahan
Michel Dumontier
Publication date: 1 January 2008
Publisher

Abstract

POS tagging is used as the first step in many NLP workflows, although the accuracy of tag assignment frequently goes unchecked. We hypothesize that changing the training corpora for a parser will affect its POS tagging of a target corpus. To this end we train the Charniak-Lease parser on the WSJ corpus and two biomedical corpora and evaluate its output to MedPost, a POS tagger with a reported 97% accuracy on biomedical text. Our findings indicate that using biomedical training corpora significantly improves performance, but that minor differences in the biomedical training corpora have a significant effect on the correctness of POS tagging. Specifically, the tagging of hyphenated words and verbs was affected. This work suggests that the choice of training corpora is crucial to domain targeted NLP analysis

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Nature Precedings

oai:nature.com:10101/npre.2008...

Last time updated on 17/02/2012

CiteSeerX

oai:CiteSeerX.psu:10.1.1.931.5...

Last time updated on 01/11/2017