Search CORE

273 research outputs found

Memory-Based Shallow Parsing

Author: Sang Erik F. Tjong Kim
Publication venue
Publication date: 01/01/2002
Field of study

We present memory-based learning approaches to shallow parsing and apply these to five tasks: base noun phrase identification, arbitrary base phrase recognition, clause detection, noun phrase parsing and full parsing. We use feature selection techniques and system combination methods for improving the performance of the memory-based learner. Our approach is evaluated on standard data sets and the results are compared with that of other systems. This reveals that our approach works well for base phrase identification while its application towards recognizing embedded structures leaves some room for improvement

arXiv.org e-Print Archive

CiteSeerX

Institutional Repository Universiteit Antwerpen

Tilburg University Repository

Learning Recursive Segments for Discourse Parsing

Author: Afantenos Stergos
Danlos Laurence
Denis Pascal
Muller Philippe
Publication venue
Publication date: 28/03/2010
Field of study

Automatically detecting discourse segments is an important preliminary step towards full discourse parsing. Previous research on discourse segmentation have relied on the assumption that elementary discourse units (EDUs) in a document always form a linear sequence (i.e., they can never be nested). Unfortunately, this assumption turns out to be too strong, for some theories of discourse like SDRT allows for nested discourse units. In this paper, we present a simple approach to discourse segmentation that is able to produce nested EDUs. Our approach builds on standard multi-class classification techniques combined with a simple repairing heuristic that enforces global coherence. Our system was developed and evaluated on the first round of annotations provided by the French Annodis project (an ongoing effort to create a discourse bank for French). Cross-validated on only 47 documents (1,445 EDUs), our system achieves encouraging performance results with an F-score of 73% for finding EDUs.Comment: published at LREC 201

arXiv.org e-Print Archive

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition

Author: Sang Erik F. Tjong Kim
Publication venue
Publication date: 01/01/2002
Field of study

We describe the CoNLL-2002 shared task: language-independent named entity recognition. We give background information on the data sets and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance.Comment: 4 page

arXiv.org e-Print Archive

Institutional Repository Universiteit Antwerpen

Tilburg University Repository

A memory-based classification approach to marker-based EBMT

Author: Stroppa Nicolas
van den Bosch Antal
Way Andy
Publication venue
Publication date: 01/01/2007
Field of study

We describe a novel approach to example-based machine translation that makes use of marker-based chunks, in which the decoder is a memory-based classifier. The classifier is trained to map trigrams of source-language chunks onto trigrams of target-language chunks; then, in a second decoding step, the predicted trigrams are rearranged according to their overlap. We present the first results of this method on a Dutch-to-English translation system using Europarl data. Sparseness of the class space causes the results to lag behind a baseline phrase-based SMT system. In a further comparison, we also apply the method to a word-aligned version of the same data, and report a smaller difference with a word-based SMT system. We explore the scaling abilities of the memory-based approach, and observe linear scaling behavior in training and classification speed and memory costs, and loglinear BLEU improvements in the amount of training examples

Irish Universities

DCU Online Research Access Service

Meta-Learning for Phonemic Annotation of Corpora

Author: Daelemans W.
Gillis S.
Hoste V.
Tjong Kim Sang E.F.
van den Bosch A.
Weigand H.
Publication venue
Publication date: 01/01/2000
Field of study

We apply rule induction, classifier combination and meta-learning (stacked classifiers) to the problem of bootstrapping high accuracy automatic annotation of corpora with pronunciation information. The task we address in this paper consists of generating phonemic representations reflecting the Flemish and Dutch pronunciations of a word on the basis of its orthographic representation (which in turn is based on the actual speech recordings). We compare several possible approaches to achieve the text-to-pronunciation mapping task: memory-based learning, transformation-based learning, rule induction, maximum entropy modeling, combination of classifiers in stacked learning, and stacking of meta-learners. We are interested both in optimal accuracy and in obtaining insight into the linguistic regularities involved. As far as accuracy is concerned, an already high accuracy level (93% for Celex and 86% for Fonilex at word level) for single classifiers is boosted significantly with additional error reductions of 31% and 38% respectively using combination of classifiers, and a further 5% using combination of meta-learners, bringing overall word level accuracy to 96% for the Dutch variant and 92% for the Flemish variant. We also show that the application of machine learning methods indeed leads to increased insight into the linguistic regularities determining the variation between the two pronunciation variants studied.Comment: 8 page

arXiv.org e-Print Archive

CiteSeerX

Ghent University Academic Bibliography

Institutional Repository Universiteit Antwerpen

Tilburg University Repository

Improving Neural Sequence Labelling Using Additional Linguistic Information

Author: Samee Muhammad Rifayat
Publication venue: Scholarship@Western
Publication date: 23/04/2019
Field of study

Sequence Labelling is the task of mapping sequential data from one domain to another domain. As we can interpret language as a sequence of words, sequence labelling is very common in the field of Natural Language Processing (NLP). In NLP, some fundamental sequence labelling tasks are Parts-of-Speech Tagging, Named Entity Recognition, Chunking, etc. Moreover, many NLP tasks can be modeled as sequence labelling or sequence to sequence labelling such as machine translation, information retrieval and question answering. An extensive amount of research has already been performed on sequence labelling. Most of the current high performing models are neural network models. These Deep Learning based models are outperforming traditional machine learning techniques by using abstract high dimensional feature representations of the input data. In this thesis, we propose a new neural sequence model which uses several additional types of linguistic information to improve the model performance. The convergence rate of the proposed model is significantly less than similar models. Moreover, our model obtains state of the art results on the benchmark datasets of POS, NER, and chunking

Scholarship@Western

Overview of BioCreative II gene mention recognition

Author: Adriaans P.
Baumgartner (jr.) W.A.
Blaschke C.
Carpenter B.
Chen Y.
Chung I-F.
Dai H.-J.
Divoli A.
Friedrich C.M.
Ganchev K.
Haddow B.
Hsu C.-N.
Hunter L.
Johnson R.
Katrenko S.
Klinger R.
Kuo C.-J.
Lin Y.-S.
Liu F.
Liu H.
Mata J.
Maña-López M.
Nakov P.
Neves M.
Povinelli R.J.
Smith L.
Struble C.A.
Sun C.
Tanabe L.K.
Torii M.
Torres R.
Tsai R.T.-H.
Vlachos A.
Wilbur W.J.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

International Migration, Integration and Social Cohesion online publications

Overview of BioCreative II gene mention recognition.

Nineteen teams presented results for the Gene Mention Task at the BioCreative II Workshop. In this task participants designed systems to identify substrings in sentences corresponding to gene name mentions. A variety of different methods were used and the results varied with a highest achieved F1 score of 0.8721. Here we present brief descriptions of all the methods used and a statistical analysis of the results. We also demonstrate that, by combining the results from all submissions, an F score of 0.9066 is feasible, and furthermore that the best result makes use of the lowest scoring submissions

epublications@Marquette

Fraunhofer-ePrints

PubMed Central

Edinburgh Research Explorer

Publications at Bielefeld University

Apollo (Cambridge)

White Rose Research Online

UvA-DARE

International Migration, Integration and Social Cohesion online publications