Search CORE

4,943 research outputs found

Corpora and evaluation tools for multilingual named entity grammar development

Author: Bering Christian
Droźdźyński Witold
Erbach Gregor
Guasch Clara
Homola Petr
Krieger Hans-Ulrich
Lehmann Sabine
Li Hong
Piskorski Jakub
Schäfer Ulrich
Shimada Atsuko
Siegel Melanie
Xu Feiyu
Ziegler-Eisele Dorothee
Publication venue
Publication date: 14/12/2011
Field of study

We present an effort for the development of multilingual named entity grammars in a unification-based finite-state formalism (SProUT). Following an extended version of the MUC7 standard, we have developed Named Entity Recognition grammars for German, Chinese, Japanese, French, Spanish, English, and Czech. The grammars recognize person names, organizations, geographical locations, currency, time and date expressions. Subgrammars and gazetteers are shared as much as possible for the grammars of the different languages. Multilingual corpora from the business domain are used for grammar development and evaluation. The annotation format (named entity and other linguistic information) is described. We present an evaluation tool which provides detailed statistics and diagnostics, allows for partial matching of annotations, and supports user-defined mappings between different annotation and grammar output formats

Hochschulschriftenserver - Universität Frankfurt am Main

Weakly Supervised Cross-Lingual Named Entity Recognition via Effective Annotation and Representation Projection

Author: Dinu Georgiana
Florian Radu
Ni Jian
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

The state-of-the-art named entity recognition (NER) systems are supervised machine learning models that require large amounts of manually annotated data to achieve high accuracy. However, annotating NER data by human is expensive and time-consuming, and can be quite difficult for a new language. In this paper, we present two weakly supervised approaches for cross-lingual NER with no human annotation in a target language. The first approach is to create automatically labeled NER data for a target language via annotation projection on comparable corpora, where we develop a heuristic scheme that effectively selects good-quality projection-labeled data from noisy data. The second approach is to project distributed representations of words (word embeddings) from a target language to a source language, so that the source-language NER system can be applied to the target language without re-training. We also design two co-decoding schemes that effectively combine the outputs of the two projection-based approaches. We evaluate the performance of the proposed approaches on both in-house and open NER data for several target languages. The results show that the combined systems outperform three other weakly supervised approaches on the CoNLL data.Comment: 11 pages, The 55th Annual Meeting of the Association for Computational Linguistics (ACL), 201

arXiv.org e-Print Archive

Crossref