831 research outputs found
Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource
Word embeddings have recently seen a strong increase in interest as a result
of strong performance gains on a variety of tasks. However, most of this
research also underlined the importance of benchmark datasets, and the
difficulty of constructing these for a variety of language-specific tasks.
Still, many of the datasets used in these tasks could prove to be fruitful
linguistic resources, allowing for unique observations into language use and
variability. In this paper we demonstrate the performance of multiple types of
embeddings, created with both count and prediction-based architectures on a
variety of corpora, in two language-specific tasks: relation evaluation, and
dialect identification. For the latter, we compare unsupervised methods with a
traditional, hand-crafted dictionary. With this research, we provide the
embeddings themselves, the relation evaluation task benchmark for use in
further research, and demonstrate how the benchmarked embeddings prove a useful
unsupervised linguistic resource, effectively used in a downstream task.Comment: in LREC 201
Multilingual Models for Compositional Distributed Semantics
We present a novel technique for learning semantic representations, which
extends the distributional hypothesis to multilingual data and joint-space
embeddings. Our models leverage parallel data and learn to strongly align the
embeddings of semantically equivalent sentences, while maintaining sufficient
distance between those of dissimilar sentences. The models do not rely on word
alignments or any syntactic information and are successfully applied to a
number of diverse languages. We extend our approach to learn semantic
representations at the document level, too. We evaluate these models on two
cross-lingual document classification tasks, outperforming the prior state of
the art. Through qualitative analysis and the study of pivoting effects we
demonstrate that our representations are semantically plausible and can capture
semantic relationships across languages without parallel data.Comment: Proceedings of ACL 2014 (Long papers
TermEval 2020 : shared task on automatic term extraction using the Annotated Corpora for term Extraction Research (ACTER) dataset
The TermEval 2020 shared task provided a platform for researchers to work on automatic term extraction (ATE) with the same dataset: the Annotated Corpora for Term Extraction Research (ACTER). The dataset covers three languages (English, French, and Dutch) and four domains, of which the domain of heart failure was kept as a held-out test set on which final f1-scores were calculated. The aim was to provide a large, transparent, qualitatively annotated, and diverse dataset to the ATE research community, with the goal of promoting comparative research and thus identifying strengths and weaknesses of various state-of-the-art methodologies. The results show a lot of variation between different systems and illustrate how some methodologies reach higher precision or recall, how different systems extract different types of terms, how some are exceptionally good at finding rare terms, or are less impacted by term length. The current contribution offers an overview of the shared task with a comparative evaluation, which complements the individual papers by all participants
Inducing Language-Agnostic Multilingual Representations
Cross-lingual representations have the potential to make NLP techniques
available to the vast majority of languages in the world. However, they
currently require large pretraining corpora or access to typologically similar
languages. In this work, we address these obstacles by removing language
identity signals from multilingual embeddings. We examine three approaches for
this: (i) re-aligning the vector spaces of target languages (all together) to a
pivot source language; (ii) removing language-specific means and variances,
which yields better discriminativeness of embeddings as a by-product; and (iii)
increasing input similarity across languages by removing morphological
contractions and sentence reordering. We evaluate on XNLI and reference-free MT
across 19 typologically diverse languages. Our findings expose the limitations
of these approaches -- unlike vector normalization, vector space re-alignment
and text normalization do not achieve consistent gains across encoders and
languages. Due to the approaches' additive effects, their combination decreases
the cross-lingual transfer gap by 8.9 points (m-BERT) and 18.2 points (XLM-R)
on average across all tasks and languages, however. Our code and models are
publicly available.Comment: *SEM2021 Camera Read
- …