1,723 research outputs found
STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset
In recent years, automatic generation of image descriptions (captions), that
is, image captioning, has attracted a great deal of attention. In this paper,
we particularly consider generating Japanese captions for images. Since most
available caption datasets have been constructed for English language, there
are few datasets for Japanese. To tackle this problem, we construct a
large-scale Japanese image caption dataset based on images from MS-COCO, which
is called STAIR Captions. STAIR Captions consists of 820,310 Japanese captions
for 164,062 images. In the experiment, we show that a neural network trained
using STAIR Captions can generate more natural and better Japanese captions,
compared to those generated using English-Japanese machine translation after
generating English captions.Comment: Accepted as ACL2017 short paper. 5 page
Lessons learned in multilingual grounded language learning
Recent work has shown how to learn better visual-semantic embeddings by
leveraging image descriptions in more than one language. Here, we investigate
in detail which conditions affect the performance of this type of grounded
language learning model. We show that multilingual training improves over
bilingual training, and that low-resource languages benefit from training with
higher-resource languages. We demonstrate that a multilingual model can be
trained equally well on either translations or comparable sentence pairs, and
that annotating the same set of images in multiple language enables further
improvements via an additional caption-caption ranking objective.Comment: CoNLL 201
Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data
Bilingual lexicon induction, translating words from the source language to
the target language, is a long-standing natural language processing task.
Recent endeavors prove that it is promising to employ images as pivot to learn
the lexicon induction without reliance on parallel corpora. However, these
vision-based approaches simply associate words with entire images, which are
constrained to translate concrete words and require object-centered images. We
humans can understand words better when they are within a sentence with
context. Therefore, in this paper, we propose to utilize images and their
associated captions to address the limitations of previous approaches. We
propose a multi-lingual caption model trained with different mono-lingual
multimodal data to map words in different languages into joint spaces. Two
types of word representation are induced from the multi-lingual caption model:
linguistic features and localized visual features. The linguistic feature is
learned from the sentence contexts with visual semantic constraints, which is
beneficial to learn translation for words that are less visual-relevant. The
localized visual feature is attended to the region in the image that correlates
to the word, so that it alleviates the image restriction for salient visual
representation. The two types of features are complementary for word
translation. Experimental results on multiple language pairs demonstrate the
effectiveness of our proposed method, which substantially outperforms previous
vision-based approaches without using any parallel sentences or supervision of
seed word pairs.Comment: Accepted by AAAI 201
- …