16 research outputs found
Limitations of Cross-Lingual Learning from Image Search
Cross-lingual representation learning is an important step in making NLP
scale to all the world's languages. Recent work on bilingual lexicon induction
suggests that it is possible to learn cross-lingual representations of words
based on similarities between images associated with these words. However, that
work focused on the translation of selected nouns only. In our work, we
investigate whether the meaning of other parts-of-speech, in particular
adjectives and verbs, can be learned in the same way. We also experiment with
combining the representations learned from visual data with embeddings learned
from textual data. Our experiments across five language pairs indicate that
previous work does not scale to the problem of learning cross-lingual
representations beyond simple nouns
Cross-lingual and cross-domain discourse segmentation of entire documents
Discourse segmentation is a crucial step in building end-to-end discourse
parsers. However, discourse segmenters only exist for a few languages and
domains. Typically they only detect intra-sentential segment boundaries,
assuming gold standard sentence and token segmentation, and relying on
high-quality syntactic parses and rich heuristics that are not generally
available across languages and domains. In this paper, we propose statistical
discourse segmenters for five languages and three domains that do not rely on
gold pre-annotations. We also consider the problem of learning discourse
segmenters when no labeled data is available for a language. Our fully
supervised system obtains 89.5% F1 for English newswire, with slight drops in
performance on other domains, and we report supervised and unsupervised
(cross-lingual) results for five languages in total.Comment: To appear in Proceedings of ACL 201
Predicting Concreteness and Imageability of Words Within and Across Languages via Word Embeddings
The notions of concreteness and imageability, traditionally important in
psycholinguistics, are gaining significance in semantic-oriented natural
language processing tasks. In this paper we investigate the predictability of
these two concepts via supervised learning, using word embeddings as
explanatory variables. We perform predictions both within and across languages
by exploiting collections of cross-lingual embeddings aligned to a single
vector space. We show that the notions of concreteness and imageability are
highly predictable both within and across languages, with a moderate loss of up
to 20% in correlation when predicting across languages. We further show that
the cross-lingual transfer via word embeddings is more efficient than the
simple transfer via bilingual dictionaries