1,344 research outputs found
Density Matching for Bilingual Word Embedding
Recent approaches to cross-lingual word embedding have generally been based
on linear transformations between the sets of embedding vectors in the two
languages. In this paper, we propose an approach that instead expresses the two
monolingual embedding spaces as probability densities defined by a Gaussian
mixture model, and matches the two densities using a method called normalizing
flow. The method requires no explicit supervision, and can be learned with only
a seed dictionary of words that have identical strings. We argue that this
formulation has several intuitively attractive properties, particularly with
the respect to improving robustness and generalization to mappings between
difficult language pairs or word pairs. On a benchmark data set of bilingual
lexicon induction and cross-lingual word similarity, our approach can achieve
competitive or superior performance compared to state-of-the-art published
results, with particularly strong results being found on etymologically distant
and/or morphologically rich languages.Comment: Accepted by NAACL-HLT 201
CLUSE: Cross-Lingual Unsupervised Sense Embeddings
This paper proposes a modularized sense induction and representation learning
model that jointly learns bilingual sense embeddings that align well in the
vector space, where the cross-lingual signal in the English-Chinese parallel
corpus is exploited to capture the collocation and distributed characteristics
in the language pair. The model is evaluated on the Stanford Contextual Word
Similarity (SCWS) dataset to ensure the quality of monolingual sense
embeddings. In addition, we introduce Bilingual Contextual Word Similarity
(BCWS), a large and high-quality dataset for evaluating cross-lingual sense
embeddings, which is the first attempt of measuring whether the learned
embeddings are indeed aligned well in the vector space. The proposed approach
shows the superior quality of sense embeddings evaluated in both monolingual
and bilingual spaces.Comment: 11 pages, accepted by EMNLP 201
Learning Unsupervised Word Mapping by Maximizing Mean Discrepancy
Cross-lingual word embeddings aim to capture common linguistic regularities
of different languages, which benefit various downstream tasks ranging from
machine translation to transfer learning. Recently, it has been shown that
these embeddings can be effectively learned by aligning two disjoint
monolingual vector spaces through a linear transformation (word mapping). In
this work, we focus on learning such a word mapping without any supervision
signal. Most previous work of this task adopts parametric metrics to measure
distribution differences, which typically requires a sophisticated alternate
optimization process, either in the form of \emph{minmax game} or intermediate
\emph{density estimation}. This alternate optimization process is relatively
hard and unstable. In order to avoid such sophisticated alternate optimization,
we propose to learn unsupervised word mapping by directly maximizing the mean
discrepancy between the distribution of transferred embedding and target
embedding. Extensive experimental results show that our proposed model
outperforms competitive baselines by a large margin
BattRAE: Bidimensional Attention-Based Recursive Autoencoders for Learning Bilingual Phrase Embeddings
In this paper, we propose a bidimensional attention based recursive
autoencoder (BattRAE) to integrate clues and sourcetarget interactions at
multiple levels of granularity into bilingual phrase representations. We employ
recursive autoencoders to generate tree structures of phrases with embeddings
at different levels of granularity (e.g., words, sub-phrases and phrases). Over
these embeddings on the source and target side, we introduce a bidimensional
attention network to learn their interactions encoded in a bidimensional
attention matrix, from which we extract two soft attention weight distributions
simultaneously. These weight distributions enable BattRAE to generate
compositive phrase representations via convolution. Based on the learned phrase
representations, we further use a bilinear neural model, trained via a
max-margin method, to measure bilingual semantic similarity. To evaluate the
effectiveness of BattRAE, we incorporate this semantic similarity as an
additional feature into a state-of-the-art SMT system. Extensive experiments on
NIST Chinese-English test sets show that our model achieves a substantial
improvement of up to 1.63 BLEU points on average over the baseline.Comment: 7 pages, accepted by AAAI 201
Unsupervised Cross-lingual Transfer of Word Embedding Spaces
Cross-lingual transfer of word embeddings aims to establish the semantic
mappings among words in different languages by learning the transformation
functions over the corresponding word embedding spaces. Successfully solving
this problem would benefit many downstream tasks such as to translate text
classification models from resource-rich languages (e.g. English) to
low-resource languages. Supervised methods for this problem rely on the
availability of cross-lingual supervision, either using parallel corpora or
bilingual lexicons as the labeled data for training, which may not be available
for many low resource languages. This paper proposes an unsupervised learning
approach that does not require any cross-lingual labeled data. Given two
monolingual word embedding spaces for any language pair, our algorithm
optimizes the transformation functions in both directions simultaneously based
on distributional matching as well as minimizing the back-translation losses.
We use a neural network implementation to calculate the Sinkhorn distance, a
well-defined distributional similarity measure, and optimize our objective
through back-propagation. Our evaluation on benchmark datasets for bilingual
lexicon induction and cross-lingual word similarity prediction shows stronger
or competitive performance of the proposed method compared to other
state-of-the-art supervised and unsupervised baseline methods over many
language pairs.Comment: EMNLP 201
Learning Multilingual Word Representations using a Bag-of-Words Autoencoder
Recent work on learning multilingual word representations usually relies on
the use of word-level alignements (e.g. infered with the help of GIZA++)
between translated sentences, in order to align the word embeddings in
different languages. In this workshop paper, we investigate an autoencoder
model for learning multilingual word representations that does without such
word-level alignements. The autoencoder is trained to reconstruct the
bag-of-word representation of given sentence from an encoded representation
extracted from its translation. We evaluate our approach on a multilingual
document classification task, where labeled data is available only for one
language (e.g. English) while classification must be performed in a different
language (e.g. French). In our experiments, we observe that our method compares
favorably with a previously proposed method that exploits word-level alignments
to learn word representations.Comment: This workshop paper was accepted on Octoble 30 2013 at the NIPS 2013
workshop on deep learning
(https://sites.google.com/site/deeplearningworkshopnips2013/accepted-papers
Learning Cross-lingual Embeddings from Twitter via Distant Supervision
Cross-lingual embeddings represent the meaning of words from different
languages in the same vector space. Recent work has shown that it is possible
to construct such representations by aligning independently learned monolingual
embedding spaces, and that accurate alignments can be obtained even without
external bilingual data. In this paper we explore a research direction that has
been surprisingly neglected in the literature: leveraging noisy user-generated
text to learn cross-lingual embeddings particularly tailored towards social
media applications. While the noisiness and informal nature of the social media
genre poses additional challenges to cross-lingual embedding methods, we find
that it also provides key opportunities due to the abundance of code-switching
and the existence of a shared vocabulary of emoji and named entities. Our
contribution consists of a very simple post-processing step that exploits these
phenomena to significantly improve the performance of state-of-the-art
alignment methods.Comment: Accepted to ICWSM 2020. 11 pages, 1 appendix. Pre-trained embeddings
available at https://github.com/pedrada88/crossembeddings-twitte
Why is unsupervised alignment of English embeddings from different algorithms so hard?
This paper presents a challenge to the community: Generative adversarial
networks (GANs) can perfectly align independent English word embeddings induced
using the same algorithm, based on distributional information alone; but fails
to do so, for two different embeddings algorithms. Why is that? We believe
understanding why, is key to understand both modern word embedding algorithms
and the limitations and instability dynamics of GANs. This paper shows that (a)
in all these cases, where alignment fails, there exists a linear transform
between the two embeddings (so algorithm biases do not lead to non-linear
differences), and (b) similar effects can not easily be obtained by varying
hyper-parameters. One plausible suggestion based on our initial experiments is
that the differences in the inductive biases of the embedding algorithms lead
to an optimization landscape that is riddled with local optima, leading to a
very small basin of convergence, but we present this more as a challenge paper
than a technical contribution.Comment: Accepted at EMNLP 201
LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space
Most of the successful and predominant methods for bilingual lexicon
induction (BLI) are mapping-based, where a linear mapping function is learned
with the assumption that the word embedding spaces of different languages
exhibit similar geometric structures (i.e., approximately isomorphic). However,
several recent studies have criticized this simplified assumption showing that
it does not hold in general even for closely related languages. In this work,
we propose a novel semi-supervised method to learn cross-lingual word
embeddings for BLI. Our model is independent of the isomorphic assumption and
uses nonlinear mapping in the latent space of two independently trained
auto-encoders. Through extensive experiments on fifteen (15) different language
pairs (in both directions) comprising resource-rich and low-resource languages
from two different datasets, we demonstrate that our method outperforms
existing models by a good margin. Ablation studies show the importance of
different model components and the necessity of non-linear mapping.Comment: 10 pages, 1 figur
Linking Tweets with Monolingual and Cross-Lingual News using Transformed Word Embeddings
Social media platforms have grown into an important medium to spread
information about an event published by the traditional media, such as news
articles. Grouping such diverse sources of information that discuss the same
topic in varied perspectives provide new insights. But the gap in word usage
between informal social media content such as tweets and diligently written
content (e.g. news articles) make such assembling difficult. In this paper, we
propose a transformation framework to bridge the word usage gap between tweets
and online news articles across languages by leveraging their word embeddings.
Using our framework, word embeddings extracted from tweets and news articles
are aligned closer to each other across languages, thus facilitating the
identification of similarity between news articles and tweets. Experimental
results show a notable improvement over baselines for monolingual tweets and
news articles comparison, while new findings are reported for cross-lingual
comparison.Comment: Presented at CICLing 2017 (18th International Conference on
Intelligent Text Processing and Computational Linguistics). To appear in
International Journal of Computational Linguistics and Applications (IJLCA
- …