33 research outputs found
STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset
In recent years, automatic generation of image descriptions (captions), that
is, image captioning, has attracted a great deal of attention. In this paper,
we particularly consider generating Japanese captions for images. Since most
available caption datasets have been constructed for English language, there
are few datasets for Japanese. To tackle this problem, we construct a
large-scale Japanese image caption dataset based on images from MS-COCO, which
is called STAIR Captions. STAIR Captions consists of 820,310 Japanese captions
for 164,062 images. In the experiment, we show that a neural network trained
using STAIR Captions can generate more natural and better Japanese captions,
compared to those generated using English-Japanese machine translation after
generating English captions.Comment: Accepted as ACL2017 short paper. 5 page
Unsupervised Cross-lingual Image Captioning
Most recent image captioning works are conducted in English as the majority
of image-caption datasets are in English. However, there are a large amount of
non-native English speakers worldwide. Generating image captions in different
languages is worth exploring. In this paper, we present a novel unsupervised
method to generate image captions without using any caption corpus. Our method
relies on 1) a cross-lingual auto-encoding, which learns the scene graph
mapping function along with the scene graph encoders and sentence decoders on
machine translation parallel corpora, and 2) an unsupervised feature mapping,
which seeks to map the encoded scene graph features from image modality to
sentence modality. By leveraging cross-lingual auto-encoding, cross-modal
feature mapping, and adversarial learning, our method can learn an image
captioner to generate captions in different languages. We verify the
effectiveness of our proposed method on the Chinese image caption generation.
The comparisons against several baseline methods demonstrate the effectiveness
of our approach.Comment: 8 page
Lessons learned in multilingual grounded language learning
Recent work has shown how to learn better visual-semantic embeddings by
leveraging image descriptions in more than one language. Here, we investigate
in detail which conditions affect the performance of this type of grounded
language learning model. We show that multilingual training improves over
bilingual training, and that low-resource languages benefit from training with
higher-resource languages. We demonstrate that a multilingual model can be
trained equally well on either translations or comparable sentence pairs, and
that annotating the same set of images in multiple language enables further
improvements via an additional caption-caption ranking objective.Comment: CoNLL 201
XL-NBT: A Cross-lingual Neural Belief Tracking Framework
Task-oriented dialog systems are becoming pervasive, and many companies
heavily rely on them to complement human agents for customer service in call
centers. With globalization, the need for providing cross-lingual customer
support becomes more urgent than ever. However, cross-lingual support poses
great challenges---it requires a large amount of additional annotated data from
native speakers. In order to bypass the expensive human annotation and achieve
the first step towards the ultimate goal of building a universal dialog system,
we set out to build a cross-lingual state tracking framework. Specifically, we
assume that there exists a source language with dialog belief tracking
annotations while the target languages have no annotated dialog data of any
form. Then, we pre-train a state tracker for the source language as a teacher,
which is able to exploit easy-to-access parallel data. We then distill and
transfer its own knowledge to the student state tracker in target languages. We
specifically discuss two types of common parallel resources: bilingual corpus
and bilingual dictionary, and design different transfer learning strategies
accordingly. Experimentally, we successfully use English state tracker as the
teacher to transfer its knowledge to both Italian and German trackers and
achieve promising results.Comment: 13 pages, 5 figures, 3 tables, accepted to EMNLP 2018 conferenc
Image Pivoting for Learning Multilingual Multimodal Representations
In this paper we propose a model to learn multimodal multilingual
representations for matching images and sentences in different languages, with
the aim of advancing multilingual versions of image search and image
understanding. Our model learns a common representation for images and their
descriptions in two different languages (which need not be parallel) by
considering the image as a pivot between two languages. We introduce a new
pairwise ranking loss function which can handle both symmetric and asymmetric
similarity between the two modalities. We evaluate our models on
image-description ranking for German and English, and on semantic textual
similarity of image descriptions in English. In both cases we achieve
state-of-the-art performance.Comment: 7 pages, EMNLP 201
Japanese SimCSE Technical Report
We report the development of Japanese SimCSE, Japanese sentence embedding
models fine-tuned with SimCSE. Since there is a lack of sentence embedding
models for Japanese that can be used as a baseline in sentence embedding
research, we conducted extensive experiments on Japanese sentence embeddings
involving 24 pre-trained Japanese or multilingual language models, five
supervised datasets, and four unsupervised datasets. In this report, we provide
the detailed training setup for Japanese SimCSE and their evaluation results
Cross-linguistic differences and similarities in image descriptions
Automatic image description systems are commonly trained and evaluated on
large image description datasets. Recently, researchers have started to collect
such datasets for languages other than English. An unexplored question is how
different these datasets are from English and, if there are any differences,
what causes them to differ. This paper provides a cross-linguistic comparison
of Dutch, English, and German image descriptions. We find that these
descriptions are similar in many respects, but the familiarity of crowd workers
with the subjects of the images has a noticeable influence on description
specificity.Comment: Accepted for INLG 2017, Santiago de Compostela, Spain, 4-7 September,
2017. Camera-ready version. See the ACL anthology for full bibliographic
informatio