85 research outputs found
The Effect of Alignment Objectives on Code-Switching Translation
One of the things that need to change when it comes to machine translation is
the models' ability to translate code-switching content, especially with the
rise of social media and user-generated content. In this paper, we are
proposing a way of training a single machine translation model that is able to
translate monolingual sentences from one language to another, along with
translating code-switched sentences to either language. This model can be
considered a bilingual model in the human sense. For better use of parallel
data, we generated synthetic code-switched (CSW) data along with an alignment
loss on the encoder to align representations across languages. Using the WMT14
English-French (En-Fr) dataset, the trained model strongly outperforms
bidirectional baselines on code-switched translation while maintaining quality
for non-code-switched (monolingual) data.Comment: This paper was originally submitted on 30/06/202
Lightweight Cross-Lingual Sentence Representation Learning
Large-scale models for learning fixed-dimensional cross-lingual sentence
representations like LASER (Artetxe and Schwenk, 2019b) lead to significant
improvement in performance on downstream tasks. However, further increases and
modifications based on such large-scale models are usually impractical due to
memory limitations. In this work, we introduce a lightweight dual-transformer
architecture with just 2 layers for generating memory-efficient cross-lingual
sentence representations. We explore different training tasks and observe that
current cross-lingual training tasks leave a lot to be desired for this shallow
architecture. To ameliorate this, we propose a novel cross-lingual language
model, which combines the existing single-word masked language model with the
newly proposed cross-lingual token-level reconstruction task. We further
augment the training task by the introduction of two computationally-lite
sentence-level contrastive learning tasks to enhance the alignment of
cross-lingual sentence representation space, which compensates for the learning
bottleneck of the lightweight transformer for generative tasks. Our comparisons
with competing models on cross-lingual sentence retrieval and multilingual
document classification confirm the effectiveness of the newly proposed
training tasks for a shallow model.Comment: ACL 202
Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization
We present Emu, a system that semantically enhances multilingual sentence
embeddings. Our framework fine-tunes pre-trained multilingual sentence
embeddings using two main components: a semantic classifier and a language
discriminator. The semantic classifier improves the semantic similarity of
related sentences, whereas the language discriminator enhances the
multilinguality of the embeddings via multilingual adversarial training. Our
experimental results based on several language pairs show that our specialized
embeddings outperform the state-of-the-art multilingual sentence embedding
model on the task of cross-lingual intent classification using only monolingual
labeled data.Comment: AAAI 202
ABSent: Cross-Lingual Sentence Representation Mapping with Bidirectional GANs
A number of cross-lingual transfer learning approaches based on neural
networks have been proposed for the case when large amounts of parallel text
are at our disposal. However, in many real-world settings, the size of parallel
annotated training data is restricted. Additionally, prior cross-lingual
mapping research has mainly focused on the word level. This raises the question
of whether such techniques can also be applied to effortlessly obtain
cross-lingually aligned sentence representations. To this end, we propose an
Adversarial Bi-directional Sentence Embedding Mapping (ABSent) framework, which
learns mappings of cross-lingual sentence representations from limited
quantities of parallel data
LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation
Large-scale language-agnostic sentence embedding models such as LaBSE (Feng
et al., 2022) obtain state-of-the-art performance for parallel sentence
alignment. However, these large-scale models can suffer from inference speed
and computation overhead. This study systematically explores learning
language-agnostic sentence embeddings with lightweight models. We demonstrate
that a thin-deep encoder can construct robust low-dimensional sentence
embeddings for 109 languages. With our proposed distillation methods, we
achieve further improvements by incorporating knowledge from a teacher model.
Empirical results on Tatoeba, United Nations, and BUCC show the effectiveness
of our lightweight models. We release our lightweight language-agnostic
sentence embedding models LEALLA on TensorFlow Hub.Comment: EACL 2023 main conference; LEALLA models:
https://tfhub.dev/google/collections/LEALL
- …