Search CORE

1 research outputs found

Unveiling Biases in Word Embeddings: An Algorithmic Approach for Comparative Analysis Based on Alignment

Author: SANGUIN PIETRO MARIA
Publication venue
Publication date: 26/09/2023
Field of study

openWord embeddings are state-of-the-art vectorial representation of words with the goal of preserving semantic similarity. They are the result of specific learning algorithms trained on usually large corpora. Consequently, they inherit all biases of the corpora on which they have been trained on. The goal of the thesis is to devise and adapt an efficient algorithm to compare two different word embeddings in order to highlight the biases they are subjected to. Specifically, we look for an alignment between the two vector spaces, corresponding to the two word embeddings, that minimises the difference between the stable words, i.e. the ones that have not changed in the two embeddings, thus highlighting the differences between the ones that did changed. In this work, we test this idea adapting a machine translation framework called MUSE that, after some improvements, can run over multiple cores in a HPC framework, specifically managed with SLURM. We also provide an amplpy implementation of linear and convex programming algorithms adapted to our case. We then test these techniques on a corpus of text taken from Italian newspapers in order to identify which words are more subject to change among the different pairs of corpora.Word embeddings are state-of-the-art vectorial representation of words with the goal of preserving semantic similarity. They are the result of specific learning algorithms trained on usually large corpora. Consequently, they inherit all biases of the corpora on which they have been trained on. The goal of the thesis is to devise and adapt an efficient algorithm to compare two different word embeddings in order to highlight the biases they are subjected to. Specifically, we look for an alignment between the two vector spaces, corresponding to the two word embeddings, that minimises the difference between the stable words, i.e. the ones that have not changed in the two embeddings, thus highlighting the differences between the ones that did changed. In this work, we test this idea adapting a machine translation framework called MUSE that, after some improvements, can run over multiple cores in a HPC framework, specifically managed with SLURM. We also provide an amplpy implementation of linear and convex programming algorithms adapted to our case. We then test these techniques on a corpus of text taken from Italian newspapers in order to identify which words are more subject to change among the different pairs of corpora

Padua Thesis and Dissertation Archive