Using Multilingual Word Embeddings for Similarity-Based Word Alignments in a Zero-Shot Setting: Tested on the Case of German–Romansh

Dolev, Eyal Liron

Using Multilingual Word Embeddings for Similarity-Based Word Alignments in a Zero-Shot Setting: Tested on the Case of German–Romansh

Authors: Eyal Liron Dolev
Publication date: 15 August 2022
Publisher
Doi

Abstract

Using multilingual word embeddings for computing word alignments has been shown to be competetive with statistical word alignment methods. However, the languages on which the experiments were made on were all “seen” languages, i.e., they were part of the training data for the embeddings. In this thesis I show that multilingual word embeddings taken from mBERT can be used for computing word alignments for the “unseen” language Romansh, aligned against German. The performance is on par with a baseline statistical model (fast_align). I also describe the creation of a gold standard for evaluating the quality of word alignments for German–Romansh, as well as the process of data collection for compiling a trilingual corpus containing press releases in German, Italian and Romansh, published by the Swiss Canton of Grisons. From this corpus, I extracted around 80,000 unique sentence pairs for each language combination

Similar works

Full text

Available Versions

ZORA

oai:www.zora.uzh.ch:233699

Last time updated on 02/08/2023