5,780 research outputs found
Introduction to the CoNLL-2000 Shared Task: Chunking
We describe the CoNLL-2000 shared task: dividing text into syntactically
related non-overlapping groups of words, so-called text chunking. We give
background information on the data sets, present a general overview of the
systems that have taken part in the shared task and briefly discuss their
performance.Comment: 6 page
Building trainable taggers in a web-based, UIMA-supported NLP workbench
Argo is a web-based NLP and text mining workbench with a convenient graphical user interface for designing and executing processing workflows of various complexity. The workbench is intended for specialists and nontechnical audiences alike, and provides the ever expanding library of analytics compliant with the Unstructured Information Management Architecture, a widely adopted interoperability framework. We explore the flexibility of this framework by demonstrating workflows involving three processing components capable of performing self-contained machine learning-based tagging. The three components are responsible for the three distinct tasks of 1) generating observations or features, 2) training a statistical model based on the generated features, and 3) tagging unlabelled data with the model. The learning and tagging components are based on an implementation of conditional random fields (CRF); whereas the feature generation component is an analytic capable of extending basic token information to a comprehensive set of features. Users define the features of their choice directly from Argoâs graphical interface, without resorting to programming (a commonly used approach to feature engineering). The experimental results performed on two tagging tasks, chunking and named entity recognition, showed that a tagger with a generic set of features built in Argo is capable of competing with taskspecific solutions.
Memory-Based Shallow Parsing
We present a memory-based learning (MBL) approach to shallow parsing in which
POS tagging, chunking, and identification of syntactic relations are formulated
as memory-based modules. The experiments reported in this paper show
competitive results, the F-value for the Wall Street Journal (WSJ) treebank is:
93.8% for NP chunking, 94.7% for VP chunking, 77.1% for subject detection and
79.0% for object detection.Comment: 8 pages, to appear in: Proceedings of the EACL'99 workshop on
Computational Natural Language Learning (CoNLL-99), Bergen, Norway, June 199
Alignment-guided chunking
We introduce an adaptable monolingual chunking approachâAlignment-Guided Chunking (AGC)âwhich makes use of knowledge of word alignments acquired from bilingual
corpora. Our approach is motivated by the observation that a sentence should be chunked differently depending
the foreseen end-tasks. For example, given the different
requirements of translation into (say) French and German, it is inappropriate to chunk up an English string in exactly the same way as preparation for translation into one
or other of these languages. We test our chunking approach
on two language pairs: FrenchâEnglish and GermanâEnglish, where these two bilingual corpora share the same English sentences. Two chunkers trained on FrenchâEnglish
(FE-Chunker) and GermanâEnglish(DE-Chunker ) respectively are used to perform chunking on the same English sentences. We construct two test sets, each suitable for Frenchâ
English and GermanâEnglish respectively. The performance of the two chunkers is evaluated on the appropriate test set and with one reference translation only, we report Fscores
of 32.63% for the FE-Chunker and 40.41% for the DE-Chunker
Chunking clinical text containing non-canonical language
Free text notes typed by primary care physicians during patient consultations typically contain highly non-canonical language. Shallow syntactic analysis of free text notes can help to reveal valuable information for the study of disease and treatment. We present an exploratory study into chunking such text using off-the-shelf language processing tools and pre-trained statistical models. We evaluate chunking accuracy with respect to part-of-speech tagging quality, choice of chunk representation, and breadth of context features. Our results indicate that narrow context feature windows give the best results, but that chunk representation and minor differences in tagging quality do not have a significant impact on chunking accuracy
- âŚ