Search CORE

760 research outputs found

Performance of the Charniak-Lease parser on biological text using different training corpora

Author: Alison V. Callahan
Michel Dumontier
Publication venue
Publication date: 01/01/2008
Field of study

POS tagging is used as the first step in many NLP workflows, although the accuracy of tag assignment frequently goes unchecked. We hypothesize that changing the training corpora for a parser will affect its POS tagging of a target corpus. To this end we train the Charniak-Lease parser on the WSJ corpus and two biomedical corpora and evaluate its output to MedPost, a POS tagger with a reported 97% accuracy on biomedical text. Our findings indicate that using biomedical training corpora significantly improves performance, but that minor differences in the biomedical training corpora have a significant effect on the correctness of POS tagging. Specifically, the tagging of hyphenated words and verbs was affected. This work suggests that the choice of training corpora is crucial to domain targeted NLP analysis

CiteSeerX

Nature Precedings

Improving Syntactic Parsing of Clinical Text Using Domain Knowledge

Author: Jiang Min
Publication venue: DigitalCommons@TMC
Publication date: 01/01/2017
Field of study

Syntactic parsing is one of the fundamental tasks of Natural Language Processing (NLP). However, few studies have explored syntactic parsing in the medical domain. This dissertation systematically investigated different methods to improve the performance of syntactic parsing of clinical text, including (1) Constructing two clinical treebanks of discharge summaries and progress notes by developing annotation guidelines that handle missing elements in clinical sentences; (2) Retraining four state-of-the-art parsers, including the Stanford parser, Berkeley parser, Charniak parser, and Bikel parser, using clinical treebanks, and comparing their performance to identify better parsing approaches; and (3) Developing new methods to reduce syntactic ambiguity caused by Prepositional Phrase (PP) attachment and coordination using semantic information. Our evaluation showed that clinical treebanks greatly improved the performance of existing parsers. The Berkeley parser achieved the best F-1 score of 86.39% on the MiPACQ treebank. For PP attachment, our proposed methods improved the accuracies of PP attachment by 2.35% on the MiPACQ corpus and 1.77% on the I2b2 corpus. For coordination, our method achieved a precision of 94.9% and a precision of 90.3% for the MiPACQ and i2b2 corpus, respectively. To further demonstrate the effectiveness of the improved parsing approaches, we applied outputs of our parsers to two external NLP tasks: semantic role labeling and temporal relation extraction. The experimental results showed that performance of both tasks’ was improved by using the parse tree information from our optimized parsers, with an improvement of 3.26% in F-measure for semantic role labelling and an improvement of 1.5% in F-measure for temporal relation extraction

DigitalCommons@The Texas Medical Center

Evaluating contributions of natural language parsers to protein–protein interaction extraction

Author: Matsuzaki Takuya
Miyao Yusuke
Sagae Kenji
Sætre Rune
Tsujii Jun'ichi
Publication venue: Oxford University Press
Publication date: 01/02/2009
Field of study

Motivation: While text mining technologies for biomedical research have gained popularity as a way to take advantage of the explosive growth of information in text form in biomedical papers, selecting appropriate natural language processing (NLP) tools is still difficult for researchers who are not familiar with recent advances in NLP. This article provides a comparative evaluation of several state-of-the-art natural language parsers, focusing on the task of extracting protein–protein interaction (PPI) from biomedical papers. We measure how each parser, and its output representation, contributes to accuracy improvement when the parser is used as a component in a PPI system

PubMed Central

eScholarship - University of California

Benchmarking natural-language parsers for biological applications using dependency graphs

Author: A Bies
AB Clegg
Adrian J Shepherd
Andrew B Clegg
B Rosario
B Srinivas
C Friedman
C Grover
C Grover
D Blaheta
D Gildea
D Klein
D Klein
D Lin
D Lin
D Sleator
DM Bikel
E Charniak
E Tsivtsivadze
EB Camon
EJ Briscoe
G Sampson
G Schneider
G Schneider
IM Goldin
J Carroll
J Carroll
J Finkel
J Xiao
JM Temkin
K Franzén
K Knight
KB Cohen
L Smith
M Collins
M Lease
MC de Marneffe
MP Marcus
N Domedel-Puig
N Ge
O Sanchez
P Merlo
PG Mutalik
S Abney
S Kübler
S Pyysalo
ST Ahmed
T Briscoe
TC Rindflesch
Y Huang
Z Shi
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Interest is growing in the application of syntactic parsers to natural language processing problems in biology, but assessing their performance is difficult because differences in linguistic convention can falsely appear to be errors. We present a method for evaluating their accuracy using an intermediate representation based on dependency graphs, in which the semantic relationships important in most information extraction tasks are closer to the surface. We also demonstrate how this method can be easily tailored to various application-driven criteria. RESULTS: Using the GENIA corpus as a gold standard, we tested four open-source parsers which have been used in bioinformatics projects. We first present overall performance measures, and test the two leading tools, the Charniak-Lease and Bikel parsers, on subtasks tailored to reflect the requirements of a system for extracting gene expression relationships. These two tools clearly outperform the other parsers in the evaluation, and achieve accuracy levels comparable to or exceeding native dependency parsers on similar tasks in previous biological evaluations. CONCLUSION: Evaluating using dependency graphs allows parsers to be tested easily on criteria chosen according to the semantics of particular biological applications, drawing attention to important mistakes and soaking up many insignificant differences that would otherwise be reported as errors. Generating high-accuracy dependency graphs from the output of phrase-structure parsers also provides access to the more detailed syntax trees that are used in several natural-language processing techniques

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Adapting a general parser to a sublanguage

Author: Aubin Sophie
Nazarenko Adeline
Nédellec Claire
Publication venue
Publication date: 01/01/2005
Field of study

In this paper, we propose a method to adapt a general parser (Link Parser) to sublanguages, focusing on the parsing of texts in biology. Our main proposal is the use of terminology (identication and analysis of terms) in order to reduce the complexity of the text to be parsed. Several other strategies are explored and finally combined among which text normalization, lexicon and morpho-guessing module extensions and grammar rules adaptation. We compare the parsing results before and after these adaptations

arXiv.org e-Print Archive

HAL Descartes

HAL-Paris 13

Parsing clinical text: how good are the state-of-the-art parsers?

Author
Publication venue: BioMed Central
Publication date: 20/05/2015
Field of study

Springer - Publisher Connector

Deep learning for extracting protein-protein interactions from biomedical literature

Author: Lu Zhiyong
Peng Yifan
Publication venue
Publication date: 01/01/2017
Field of study

State-of-the-art methods for protein-protein interaction (PPI) extraction are primarily feature-based or kernel-based by leveraging lexical and syntactic information. But how to incorporate such knowledge in the recent deep learning methods remains an open question. In this paper, we propose a multichannel dependency-based convolutional neural network model (McDepCNN). It applies one channel to the embedding vector of each word in the sentence, and another channel to the embedding vector of the head of the corresponding word. Therefore, the model can use richer information obtained from different channels. Experiments on two public benchmarking datasets, AIMed and BioInfer, demonstrate that McDepCNN compares favorably to the state-of-the-art rich-feature and single-kernel based methods. In addition, McDepCNN achieves 24.4% relative improvement in F1-score over the state-of-the-art methods on cross-corpus evaluation and 12% improvement in F1-score over kernel-based methods on "difficult" instances. These results suggest that McDepCNN generalizes more easily over different corpora, and is capable of capturing long distance features in the sentences.Comment: Accepted for publication in Proceedings of the 2017 Workshop on Biomedical Natural Language Processing, 10 pages, 2 figures, 6 table

arXiv.org e-Print Archive

Crossref