9,530 research outputs found
Challenges and solutions for Latin named entity recognition
Although spanning thousands of years and genres as diverse as liturgy, historiography, lyric and other forms of prose and poetry, the body of Latin texts is still relatively sparse compared to English. Data sparsity in Latin presents a number of challenges for traditional Named Entity
Recognition techniques. Solving such challenges and enabling reliable Named Entity Recognition in Latin texts can facilitate many down-stream applications, from machine translation to digital historiography, enabling Classicists, historians, and archaeologists for instance, to track
the relationships of historical persons, places, and groups on a large scale. This paper presents the first annotated corpus for evaluating Named Entity Recognition in Latin, as well as a fully supervised model that achieves over 90% F-score on a held-out test set, significantly outperforming a competitive baseline. We also present a novel active learning strategy that predicts how many and which sentences need to be annotated for named entities in order to attain a specified degree
of accuracy when recognizing named entities automatically in a given text. This maximizes the productivity of annotators while simultaneously controlling quality
Empower Sequence Labeling with Task-Aware Neural Language Model
Linguistic sequence labeling is a general modeling approach that encompasses
a variety of problems, such as part-of-speech tagging and named entity
recognition. Recent advances in neural networks (NNs) make it possible to build
reliable models without handcrafted features. However, in many cases, it is
hard to obtain sufficient annotations to train these models. In this study, we
develop a novel neural framework to extract abundant knowledge hidden in raw
texts to empower the sequence labeling task. Besides word-level knowledge
contained in pre-trained word embeddings, character-aware neural language
models are incorporated to extract character-level knowledge. Transfer learning
techniques are further adopted to mediate different components and guide the
language model towards the key knowledge. Comparing to previous methods, these
task-specific knowledge allows us to adopt a more concise model and conduct
more efficient training. Different from most transfer learning methods, the
proposed framework does not rely on any additional supervision. It extracts
knowledge from self-contained order information of training sequences.
Extensive experiments on benchmark datasets demonstrate the effectiveness of
leveraging character-level knowledge and the efficiency of co-training. For
example, on the CoNLL03 NER task, model training completes in about 6 hours on
a single GPU, reaching F1 score of 91.710.10 without using any extra
annotation.Comment: AAAI 201
Verbal chunk extraction in French using limited resources
A way of extracting French verbal chunks, inflected and infinitive, is
explored and tested on effective corpus. Declarative morphological and local
grammar rules specifying chunks and some simple contextual structures are used,
relying on limited lexical information and some simple heuristic/statistic
properties obtained from restricted corpora. The specific goals, the
architecture and the formalism of the system, the linguistic information on
which it relies and the obtained results on effective corpus are presented
Cross-lingual Argumentation Mining: Machine Translation (and a bit of Projection) is All You Need!
Argumentation mining (AM) requires the identification of complex discourse
structures and has lately been applied with success monolingually. In this
work, we show that the existing resources are, however, not adequate for
assessing cross-lingual AM, due to their heterogeneity or lack of complexity.
We therefore create suitable parallel corpora by (human and machine)
translating a popular AM dataset consisting of persuasive student essays into
German, French, Spanish, and Chinese. We then compare (i) annotation projection
and (ii) bilingual word embeddings based direct transfer strategies for
cross-lingual AM, finding that the former performs considerably better and
almost eliminates the loss from cross-lingual transfer. Moreover, we find that
annotation projection works equally well when using either costly human or
cheap machine translations. Our code and data are available at
\url{http://github.com/UKPLab/coling2018-xling_argument_mining}.Comment: Accepted at Coling 201
- …