24,192 research outputs found
Evaluating the Underlying Gender Bias in Contextualized Word Embeddings
Gender bias is highly impacting natural language processing applications.
Word embeddings have clearly been proven both to keep and amplify gender biases
that are present in current data sources. Recently, contextualized word
embeddings have enhanced previous word embedding techniques by computing word
vector representations dependent on the sentence they appear in.
In this paper, we study the impact of this conceptual change in the word
embedding computation in relation with gender bias. Our analysis includes
different measures previously applied in the literature to standard word
embeddings. Our findings suggest that contextualized word embeddings are less
biased than standard ones even when the latter are debiased
Using Word Embeddings in Twitter Election Classification
Word embeddings and convolutional neural networks (CNN)
have attracted extensive attention in various classification
tasks for Twitter, e.g. sentiment classification. However,
the effect of the configuration used to train and generate
the word embeddings on the classification performance has
not been studied in the existing literature. In this paper,
using a Twitter election classification task that aims to detect
election-related tweets, we investigate the impact of
the background dataset used to train the embedding models,
the context window size and the dimensionality of word
embeddings on the classification performance. By comparing
the classification results of two word embedding models,
which are trained using different background corpora
(e.g. Wikipedia articles and Twitter microposts), we show
that the background data type should align with the Twitter
classification dataset to achieve a better performance. Moreover,
by evaluating the results of word embeddings models
trained using various context window sizes and dimensionalities,
we found that large context window and dimension
sizes are preferable to improve the performance. Our experimental
results also show that using word embeddings and
CNN leads to statistically significant improvements over various
baselines such as random, SVM with TF-IDF and SVM
with word embeddings
Compressing Word Embeddings
Recent methods for learning vector space representations of words have
succeeded in capturing fine-grained semantic and syntactic regularities using
vector arithmetic. However, these vector space representations (created through
large-scale text analysis) are typically stored verbatim, since their internal
structure is opaque. Using word-analogy tests to monitor the level of detail
stored in compressed re-representations of the same vector space, the
trade-offs between the reduction in memory usage and expressiveness are
investigated. A simple scheme is outlined that can reduce the memory footprint
of a state-of-the-art embedding by a factor of 10, with only minimal impact on
performance. Then, using the same `bit budget', a binary (approximate)
factorisation of the same space is also explored, with the aim of creating an
equivalent representation with better interpretability.Comment: 10 pages, 0 figures, submitted to ICONIP-2016. Previous experimental
results were submitted to ICLR-2016, but the paper has been significantly
updated, since a new experimental set-up worked much bette
- …