10 research outputs found
Limitations of Cross-Lingual Learning from Image Search
Cross-lingual representation learning is an important step in making NLP
scale to all the world's languages. Recent work on bilingual lexicon induction
suggests that it is possible to learn cross-lingual representations of words
based on similarities between images associated with these words. However, that
work focused on the translation of selected nouns only. In our work, we
investigate whether the meaning of other parts-of-speech, in particular
adjectives and verbs, can be learned in the same way. We also experiment with
combining the representations learned from visual data with embeddings learned
from textual data. Our experiments across five language pairs indicate that
previous work does not scale to the problem of learning cross-lingual
representations beyond simple nouns
Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data
Bilingual lexicon induction, translating words from the source language to
the target language, is a long-standing natural language processing task.
Recent endeavors prove that it is promising to employ images as pivot to learn
the lexicon induction without reliance on parallel corpora. However, these
vision-based approaches simply associate words with entire images, which are
constrained to translate concrete words and require object-centered images. We
humans can understand words better when they are within a sentence with
context. Therefore, in this paper, we propose to utilize images and their
associated captions to address the limitations of previous approaches. We
propose a multi-lingual caption model trained with different mono-lingual
multimodal data to map words in different languages into joint spaces. Two
types of word representation are induced from the multi-lingual caption model:
linguistic features and localized visual features. The linguistic feature is
learned from the sentence contexts with visual semantic constraints, which is
beneficial to learn translation for words that are less visual-relevant. The
localized visual feature is attended to the region in the image that correlates
to the word, so that it alleviates the image restriction for salient visual
representation. The two types of features are complementary for word
translation. Experimental results on multiple language pairs demonstrate the
effectiveness of our proposed method, which substantially outperforms previous
vision-based approaches without using any parallel sentences or supervision of
seed word pairs.Comment: Accepted by AAAI 201
Visual Pivoting for (Unsupervised) Entity Alignment
This work studies the use of visual semantic representations to align
entities in heterogeneous knowledge graphs (KGs). Images are natural components
of many existing KGs. By combining visual knowledge with other auxiliary
information, we show that the proposed new approach, EVA, creates a holistic
entity representation that provides strong signals for cross-graph entity
alignment. Besides, previous entity alignment methods require human labelled
seed alignment, restricting availability. EVA provides a completely
unsupervised solution by leveraging the visual similarity of entities to create
an initial seed dictionary (visual pivots). Experiments on benchmark data sets
DBP15k and DWY15k show that EVA offers state-of-the-art performance on both
monolingual and cross-lingual entity alignment tasks. Furthermore, we discover
that images are particularly useful to align long-tail KG entities, which
inherently lack the structural contexts necessary for capturing the
correspondences.Comment: To appear at AAAI-202
Bootstrapping Disjoint Datasets for Multilingual Multimodal Representation Learning
Recent work has highlighted the advantage of jointly learning grounded
sentence representations from multiple languages. However, the data used in
these studies has been limited to an aligned scenario: the same images
annotated with sentences in multiple languages. We focus on the more realistic
disjoint scenario in which there is no overlap between the images in
multilingual image--caption datasets. We confirm that training with aligned
data results in better grounded sentence representations than training with
disjoint data, as measured by image--sentence retrieval performance. In order
to close this gap in performance, we propose a pseudopairing method to generate
synthetically aligned English--German--image triplets from the disjoint sets.
The method works by first training a model on the disjoint data, and then
creating new triples across datasets using sentence similarity under the
learned model. Experiments show that pseudopairs improve image--sentence
retrieval performance compared to disjoint training, despite requiring no
external data or models. However, we do find that using an external machine
translation model to generate the synthetic data sets results in better
performance.Comment: 10 page
Visual bilingual lexicon induction with transferred ConvNet features
This paper is concerned with the task of bilingual lexicon induction using image-based features. By applying features from a convolutional neural network (CNN), we obtain state-of-the-art performance on a standard dataset, obtaining a 79% relative improvement over previous work which uses bags of visual words based on SIFT features. The CNN image-based approach is also compared with state-of-the-art linguistic approaches to bilingual lexicon induction, even outperforming these for one of three language pairs on another standard dataset. Furthermore, we shed new
light on the type of visual similarity metric to use for genuine similarity versus relatedness tasks, and experiment with using multiple layers from the same network in an attempt to improve performance.status: accepte