Search CORE

85 research outputs found

The Effect of Alignment Objectives on Code-Switching Translation

Author: Anwar Mohamed
Publication venue
Publication date: 10/09/2023
Field of study

One of the things that need to change when it comes to machine translation is the models' ability to translate code-switching content, especially with the rise of social media and user-generated content. In this paper, we are proposing a way of training a single machine translation model that is able to translate monolingual sentences from one language to another, along with translating code-switched sentences to either language. This model can be considered a bilingual model in the human sense. For better use of parallel data, we generated synthetic code-switched (CSW) data along with an alignment loss on the encoder to align representations across languages. Using the WMT14 English-French (En-Fr) dataset, the trained model strongly outperforms bidirectional baselines on code-switched translation while maintaining quality for non-code-switched (monolingual) data.Comment: This paper was originally submitted on 30/06/202

arXiv.org e-Print Archive

Lightweight Cross-Lingual Sentence Representation Learning

Author: Chu Chenhui
Gupta Prakhar
Jaggi Martin
Kurohashi Sadao
Mao Zhuoyuan
Publication venue
Publication date: 12/06/2021
Field of study

Large-scale models for learning fixed-dimensional cross-lingual sentence representations like LASER (Artetxe and Schwenk, 2019b) lead to significant improvement in performance on downstream tasks. However, further increases and modifications based on such large-scale models are usually impractical due to memory limitations. In this work, we introduce a lightweight dual-transformer architecture with just 2 layers for generating memory-efficient cross-lingual sentence representations. We explore different training tasks and observe that current cross-lingual training tasks leave a lot to be desired for this shallow architecture. To ameliorate this, we propose a novel cross-lingual language model, which combines the existing single-word masked language model with the newly proposed cross-lingual token-level reconstruction task. We further augment the training task by the introduction of two computationally-lite sentence-level contrastive learning tasks to enhance the alignment of cross-lingual sentence representation space, which compensates for the learning bottleneck of the lightweight transformer for generative tasks. Our comparisons with competing models on cross-lingual sentence retrieval and multilingual document classification confirm the effectiveness of the newly proposed training tasks for a shallow model.Comment: ACL 202

arXiv.org e-Print Archive

Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

Author: Golshan Behzad
Hirota Wataru
Suhara Yoshihiko
Tan Wang-Chiew
Publication venue
Publication date: 24/11/2019
Field of study

We present Emu, a system that semantically enhances multilingual sentence embeddings. Our framework fine-tunes pre-trained multilingual sentence embeddings using two main components: a semantic classifier and a language discriminator. The semantic classifier improves the semantic similarity of related sentences, whereas the language discriminator enhances the multilinguality of the embeddings via multilingual adversarial training. Our experimental results based on several language pairs show that our specialized embeddings outperform the state-of-the-art multilingual sentence embedding model on the task of cross-lingual intent classification using only monolingual labeled data.Comment: AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

ABSent: Cross-Lingual Sentence Representation Mapping with Bidirectional GANs

Author: de Melo Gerard
Dong Xin
Fu Zuohui
Ge Yingqiang
Geng Shijie
Wang Guang
Wang Yuting
Xian Yikun
Publication venue
Publication date: 29/01/2020
Field of study

A number of cross-lingual transfer learning approaches based on neural networks have been proposed for the case when large amounts of parallel text are at our disposal. However, in many real-world settings, the size of parallel annotated training data is restricted. Additionally, prior cross-lingual mapping research has mainly focused on the word level. This raises the question of whether such techniques can also be applied to effortlessly obtain cross-lingually aligned sentence representations. To this end, we propose an Adversarial Bi-directional Sentence Embedding Mapping (ABSent) framework, which learns mappings of cross-lingual sentence representations from limited quantities of parallel data

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation

Author: Mao Zhuoyuan
Nakagawa Tetsuji
Publication venue
Publication date: 16/02/2023
Field of study

Large-scale language-agnostic sentence embedding models such as LaBSE (Feng et al., 2022) obtain state-of-the-art performance for parallel sentence alignment. However, these large-scale models can suffer from inference speed and computation overhead. This study systematically explores learning language-agnostic sentence embeddings with lightweight models. We demonstrate that a thin-deep encoder can construct robust low-dimensional sentence embeddings for 109 languages. With our proposed distillation methods, we achieve further improvements by incorporating knowledge from a teacher model. Empirical results on Tatoeba, United Nations, and BUCC show the effectiveness of our lightweight models. We release our lightweight language-agnostic sentence embedding models LEALLA on TensorFlow Hub.Comment: EACL 2023 main conference; LEALLA models: https://tfhub.dev/google/collections/LEALL

arXiv.org e-Print Archive