273 research outputs found
Memory-Based Shallow Parsing
We present memory-based learning approaches to shallow parsing and apply
these to five tasks: base noun phrase identification, arbitrary base phrase
recognition, clause detection, noun phrase parsing and full parsing. We use
feature selection techniques and system combination methods for improving the
performance of the memory-based learner. Our approach is evaluated on standard
data sets and the results are compared with that of other systems. This reveals
that our approach works well for base phrase identification while its
application towards recognizing embedded structures leaves some room for
improvement
Learning Recursive Segments for Discourse Parsing
Automatically detecting discourse segments is an important preliminary step
towards full discourse parsing. Previous research on discourse segmentation
have relied on the assumption that elementary discourse units (EDUs) in a
document always form a linear sequence (i.e., they can never be nested).
Unfortunately, this assumption turns out to be too strong, for some theories of
discourse like SDRT allows for nested discourse units. In this paper, we
present a simple approach to discourse segmentation that is able to produce
nested EDUs. Our approach builds on standard multi-class classification
techniques combined with a simple repairing heuristic that enforces global
coherence. Our system was developed and evaluated on the first round of
annotations provided by the French Annodis project (an ongoing effort to create
a discourse bank for French). Cross-validated on only 47 documents (1,445
EDUs), our system achieves encouraging performance results with an F-score of
73% for finding EDUs.Comment: published at LREC 201
Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition
We describe the CoNLL-2002 shared task: language-independent named entity
recognition. We give background information on the data sets and the evaluation
method, present a general overview of the systems that have taken part in the
task and discuss their performance.Comment: 4 page
A memory-based classification approach to marker-based EBMT
We describe a novel approach to example-based machine translation that makes use of marker-based chunks, in which the decoder is a memory-based classifier. The classifier is trained to map trigrams of source-language chunks onto trigrams of target-language chunks; then, in a second
decoding step, the predicted trigrams are rearranged according to their overlap. We present the first results of this method on a Dutch-to-English translation system
using Europarl data. Sparseness of the class space causes the results to lag behind a baseline phrase-based SMT system.
In a further comparison, we also
apply the method to a word-aligned version
of the same data, and report a smaller
difference with a word-based SMT system.
We explore the scaling abilities of the
memory-based approach, and observe linear
scaling behavior in training and classification
speed and memory costs, and loglinear
BLEU improvements in the amount
of training examples
Meta-Learning for Phonemic Annotation of Corpora
We apply rule induction, classifier combination and meta-learning (stacked
classifiers) to the problem of bootstrapping high accuracy automatic annotation
of corpora with pronunciation information. The task we address in this paper
consists of generating phonemic representations reflecting the Flemish and
Dutch pronunciations of a word on the basis of its orthographic representation
(which in turn is based on the actual speech recordings). We compare several
possible approaches to achieve the text-to-pronunciation mapping task:
memory-based learning, transformation-based learning, rule induction, maximum
entropy modeling, combination of classifiers in stacked learning, and stacking
of meta-learners. We are interested both in optimal accuracy and in obtaining
insight into the linguistic regularities involved. As far as accuracy is
concerned, an already high accuracy level (93% for Celex and 86% for Fonilex at
word level) for single classifiers is boosted significantly with additional
error reductions of 31% and 38% respectively using combination of classifiers,
and a further 5% using combination of meta-learners, bringing overall word
level accuracy to 96% for the Dutch variant and 92% for the Flemish variant. We
also show that the application of machine learning methods indeed leads to
increased insight into the linguistic regularities determining the variation
between the two pronunciation variants studied.Comment: 8 page
Improving Neural Sequence Labelling Using Additional Linguistic Information
Sequence Labelling is the task of mapping sequential data from one domain to another domain. As we can interpret language as a sequence of words, sequence labelling is very common in the field of Natural Language Processing (NLP). In NLP, some fundamental sequence labelling tasks are Parts-of-Speech Tagging, Named Entity Recognition, Chunking, etc. Moreover, many NLP tasks can be modeled as sequence labelling or sequence to sequence labelling such as machine translation, information retrieval and question answering. An extensive amount of research has already been performed on sequence labelling. Most of the current high performing models are neural network models. These Deep Learning based models are outperforming traditional machine learning techniques by using abstract high dimensional feature representations of the input data. In this thesis, we propose a new neural sequence model which uses several additional types of linguistic information to improve the model performance. The convergence rate of the proposed model is significantly less than similar models. Moreover, our model obtains state of the art results on the benchmark datasets of POS, NER, and chunking
Overview of BioCreative II gene mention recognition.
Nineteen teams presented results for the Gene Mention Task at the BioCreative II Workshop. In this task participants designed systems to identify substrings in sentences corresponding to gene name mentions. A variety of different methods were used and the results varied with a highest achieved F1 score of 0.8721. Here we present brief descriptions of all the methods used and a statistical analysis of the results. We also demonstrate that, by combining the results from all submissions, an F score of 0.9066 is feasible, and furthermore that the best result makes use of the lowest scoring submissions
- …