Using Multilingual Word Embeddings for Similarity-Based Word Alignments in a Zero-Shot Setting: Tested on the Case of German–Romansh

Abstract

Using multilingual word embeddings for computing word alignments has been shown to be competetive with statistical word alignment methods. However, the languages on which the experiments were made on were all “seen” languages, i.e., they were part of the training data for the embeddings. In this thesis I show that multilingual word embeddings taken from mBERT can be used for computing word alignments for the “unseen” language Romansh, aligned against German. The performance is on par with a baseline statistical model (fast_align). I also describe the creation of a gold standard for evaluating the quality of word alignments for German–Romansh, as well as the process of data collection for compiling a trilingual corpus containing press releases in German, Italian and Romansh, published by the Swiss Canton of Grisons. From this corpus, I extracted around 80,000 unique sentence pairs for each language combination

    Similar works

    Full text

    thumbnail-image

    Available Versions

    Last time updated on 02/08/2023