37 research outputs found
Video Captioning with Guidance of Multimodal Latent Topics
The topic diversity of open-domain videos leads to various vocabularies and
linguistic expressions in describing video contents, and therefore, makes the
video captioning task even more challenging. In this paper, we propose an
unified caption framework, M&M TGM, which mines multimodal topics in
unsupervised fashion from data and guides the caption decoder with these
topics. Compared to pre-defined topics, the mined multimodal topics are more
semantically and visually coherent and can reflect the topic distribution of
videos better. We formulate the topic-aware caption generation as a multi-task
learning problem, in which we add a parallel task, topic prediction, in
addition to the caption task. For the topic prediction task, we use the mined
topics as the teacher to train a student topic prediction model, which learns
to predict the latent topics from multimodal contents of videos. The topic
prediction provides intermediate supervision to the learning process. As for
the caption task, we propose a novel topic-aware decoder to generate more
accurate and detailed video descriptions with the guidance from latent topics.
The entire learning procedure is end-to-end and it optimizes both tasks
simultaneously. The results from extensive experiments conducted on the MSR-VTT
and Youtube2Text datasets demonstrate the effectiveness of our proposed model.
M&M TGM not only outperforms prior state-of-the-art methods on multiple
evaluation metrics and on both benchmark datasets, but also achieves better
generalization ability.Comment: ACM Multimedia 201
Development of a Causal Model for Improving Rural Seniors' Accessibility: Data Evidences
Seniors residing in rural areas often encounter limited accessibility to
opportunities, resources, and services. This paper introduces a model proposing
that both aging and rural residency are factors contributing to the restricted
accessibility faced by rural seniors. Leveraging data from the 2017 National
Household Travel Survey, the study examines three hypotheses pertaining to this
causal model. Multiple causal pathways emerge in the data analysis, with
mobility identified as a mediator in one of them. The study further identifies
specific challenges faced by rural seniors, such as the reduced accessibility
in reaching medical services and assisting others. These challenges stem
primarily from aging and geographic obstacles that not only diminish their
willingness to travel but also restrict more in the group from choosing
transportation modes with higher mobility. The insights gained from this study
serve as a foundation for devising effective methods to enhance transportation
accessibility for seniors in rural areas.Comment: 12 pages 5 table
Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data
Bilingual lexicon induction, translating words from the source language to
the target language, is a long-standing natural language processing task.
Recent endeavors prove that it is promising to employ images as pivot to learn
the lexicon induction without reliance on parallel corpora. However, these
vision-based approaches simply associate words with entire images, which are
constrained to translate concrete words and require object-centered images. We
humans can understand words better when they are within a sentence with
context. Therefore, in this paper, we propose to utilize images and their
associated captions to address the limitations of previous approaches. We
propose a multi-lingual caption model trained with different mono-lingual
multimodal data to map words in different languages into joint spaces. Two
types of word representation are induced from the multi-lingual caption model:
linguistic features and localized visual features. The linguistic feature is
learned from the sentence contexts with visual semantic constraints, which is
beneficial to learn translation for words that are less visual-relevant. The
localized visual feature is attended to the region in the image that correlates
to the word, so that it alleviates the image restriction for salient visual
representation. The two types of features are complementary for word
translation. Experimental results on multiple language pairs demonstrate the
effectiveness of our proposed method, which substantially outperforms previous
vision-based approaches without using any parallel sentences or supervision of
seed word pairs.Comment: Accepted by AAAI 201
Explore and Tell: Embodied Visual Captioning in 3D Environments
While current visual captioning models have achieved impressive performance,
they often assume that the image is well-captured and provides a complete view
of the scene. In real-world scenarios, however, a single image may not offer a
good viewpoint, hindering fine-grained scene understanding. To overcome this
limitation, we propose a novel task called Embodied Captioning, which equips
visual captioning models with navigation capabilities, enabling them to
actively explore the scene and reduce visual ambiguity from suboptimal
viewpoints. Specifically, starting at a random viewpoint, an agent must
navigate the environment to gather information from different viewpoints and
generate a comprehensive paragraph describing all objects in the scene. To
support this task, we build the ET-Cap dataset with Kubric simulator,
consisting of 10K 3D scenes with cluttered objects and three annotated
paragraphs per scene. We propose a Cascade Embodied Captioning model (CaBOT),
which comprises of a navigator and a captioner, to tackle this task. The
navigator predicts which actions to take in the environment, while the
captioner generates a paragraph description based on the whole navigation
trajectory. Extensive experiments demonstrate that our model outperforms other
carefully designed baselines. Our dataset, codes and models are available at
https://aim3-ruc.github.io/ExploreAndTell.Comment: 12 pages; 10 figures; ICCV 202