919 research outputs found
Joint Representation Learning of Cross-lingual Words and Entities via Attentive Distant Supervision
Joint representation learning of words and entities benefits many NLP tasks,
but has not been well explored in cross-lingual settings. In this paper, we
propose a novel method for joint representation learning of cross-lingual words
and entities. It captures mutually complementary knowledge, and enables
cross-lingual inferences among knowledge bases and texts. Our method does not
require parallel corpora, and automatically generates comparable data via
distant supervision using multi-lingual knowledge bases. We utilize two types
of regularizers to align cross-lingual words and entities, and design knowledge
attention and cross-lingual attention to further reduce noises. We conducted a
series of experiments on three tasks: word translation, entity relatedness, and
cross-lingual entity linking. The results, both qualitatively and
quantitatively, demonstrate the significance of our method.Comment: 11 pages, EMNLP201
Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark
Modern Entity Linking (EL) systems entrench a popularity bias, yet there is
no dataset focusing on tail and emerging entities in languages other than
English. We present Hansel, a new benchmark in Chinese that fills the vacancy
of non-English few-shot and zero-shot EL challenges. The test set of Hansel is
human annotated and reviewed, created with a novel method for collecting
zero-shot EL datasets. It covers 10K diverse documents in news, social media
posts and other web articles, with Wikidata as its target Knowledge Base. We
demonstrate that the existing state-of-the-art EL system performs poorly on
Hansel (R@1 of 36.6% on Few-Shot). We then establish a strong baseline that
scores a R@1 of 46.2% on Few-Shot and 76.6% on Zero-Shot on our dataset. We
also show that our baseline achieves competitive results on TAC-KBP2015 Chinese
Entity Linking task.Comment: WSDM 202
WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning Experiments for Slovak Named Entity Recognition
Named Entity Recognition (NER) is a fundamental NLP tasks with a wide range
of practical applications. The performance of state-of-the-art NER methods
depends on high quality manually anotated datasets which still do not exist for
some languages. In this work we aim to remedy this situation in Slovak by
introducing WikiGoldSK, the first sizable human labelled Slovak NER dataset. We
benchmark it by evaluating state-of-the-art multilingual Pretrained Language
Models and comparing it to the existing silver-standard Slovak NER dataset. We
also conduct few-shot experiments and show that training on a sliver-standard
dataset yields better results. To enable future work that can be based on
Slovak NER, we release the dataset, code, as well as the trained models
publicly under permissible licensing terms at
https://github.com/NaiveNeuron/WikiGoldSK.Comment: BSNLP 2023 Workshop at EACL 202
- …