Search CORE

419 research outputs found

Codifcadores para modelos latentes con múltiples realizaciones no homomórfcas

Author: Vélez Martín William
Publication venue
Publication date: 01/06/2021
Field of study

Natural Language Processing (NLP) is a commonly used feld of AI for tasks ranging from translation, speech recognition, handwriting recognition, or even part of speech (POS) tagging. Word embedding is the practice of representing words in a mathematical manner to perform NLP tasks. Typically this embedding is done through vectors. The idea behind this thesis is related: to create probabilistic representations for a word’s part of speech. For this, a special Conditioned Variational AutoEncoder (CVAE) will be used. The CVAE is trained to copy its input by creating a latent space from which the model will sample to generate its output, in this case a word (and its context). The variable it will be conditioned to is the POS tag. These can be easily obtained with libraries like NLTK, but the word that will be input to the model must have some sort of representation. A popular approach is word2vec, which assigns a vectors representation to each word given a large text corpus. However, word2vec does not include contextual representation in its vectors. Another solution is Google’s Transformer BERT, a bidirectional language model trained to predict a masked word in a sequence, and also determine whether two sequences are a continuation of each other. BERT can also be used to create word embeddings, by performing feature extraction of its hidden layers for any given word in a sequence. Because of its bidirectional nature, there embeddings contain ordered contextual information from the left and right. Although not consistently, the CVAE is able to represent three different POS of a word with three different Gaussians. A series of weights that select one Gaussian when sampling from the latent space, prove to be essential to accomplish this task. This work could be expanded so that the representations are for word meanings, or even word meanings across multiple languages

Biblos-e Archivo