Search CORE

37 research outputs found

Video Captioning with Guidance of Multimodal Latent Topics

Author: Chen Jia
Chen Shizhe
Hauptmann Alexander
Jin Qin
Publication venue
Publication date: 14/02/2023
Field of study

The topic diversity of open-domain videos leads to various vocabularies and linguistic expressions in describing video contents, and therefore, makes the video captioning task even more challenging. In this paper, we propose an unified caption framework, M&M TGM, which mines multimodal topics in unsupervised fashion from data and guides the caption decoder with these topics. Compared to pre-defined topics, the mined multimodal topics are more semantically and visually coherent and can reflect the topic distribution of videos better. We formulate the topic-aware caption generation as a multi-task learning problem, in which we add a parallel task, topic prediction, in addition to the caption task. For the topic prediction task, we use the mined topics as the teacher to train a student topic prediction model, which learns to predict the latent topics from multimodal contents of videos. The topic prediction provides intermediate supervision to the learning process. As for the caption task, we propose a novel topic-aware decoder to generate more accurate and detailed video descriptions with the guidance from latent topics. The entire learning procedure is end-to-end and it optimizes both tasks simultaneously. The results from extensive experiments conducted on the MSR-VTT and Youtube2Text datasets demonstrate the effectiveness of our proposed model. M&M TGM not only outperforms prior state-of-the-art methods on multiple evaluation metrics and on both benchmark datasets, but also achieves better generalization ability.Comment: ACM Multimedia 201

arXiv.org e-Print Archive

Development of a Causal Model for Improving Rural Seniors' Accessibility: Data Evidences

Author: Li Ke
Li Shizhe
Qin Ruwen
Publication venue
Publication date: 23/01/2024
Field of study

Seniors residing in rural areas often encounter limited accessibility to opportunities, resources, and services. This paper introduces a model proposing that both aging and rural residency are factors contributing to the restricted accessibility faced by rural seniors. Leveraging data from the 2017 National Household Travel Survey, the study examines three hypotheses pertaining to this causal model. Multiple causal pathways emerge in the data analysis, with mobility identified as a mediator in one of them. The study further identifies specific challenges faced by rural seniors, such as the reduced accessibility in reaching medical services and assisting others. These challenges stem primarily from aging and geographic obstacles that not only diminish their willingness to travel but also restrict more in the group from choosing transportation modes with higher mobility. The insights gained from this study serve as a foundation for devising effective methods to enhance transportation accessibility for seniors in rural areas.Comment: 12 pages 5 table

arXiv.org e-Print Archive

Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data

Author: Chen Shizhe
Hauptmann Alexander
Jin Qin
Publication venue
Publication date: 02/06/2019
Field of study

Bilingual lexicon induction, translating words from the source language to the target language, is a long-standing natural language processing task. Recent endeavors prove that it is promising to employ images as pivot to learn the lexicon induction without reliance on parallel corpora. However, these vision-based approaches simply associate words with entire images, which are constrained to translate concrete words and require object-centered images. We humans can understand words better when they are within a sentence with context. Therefore, in this paper, we propose to utilize images and their associated captions to address the limitations of previous approaches. We propose a multi-lingual caption model trained with different mono-lingual multimodal data to map words in different languages into joint spaces. Two types of word representation are induced from the multi-lingual caption model: linguistic features and localized visual features. The linguistic feature is learned from the sentence contexts with visual semantic constraints, which is beneficial to learn translation for words that are less visual-relevant. The localized visual feature is attended to the region in the image that correlates to the word, so that it alleviates the image restriction for salient visual representation. The two types of features are complementary for word translation. Experimental results on multiple language pairs demonstrate the effectiveness of our proposed method, which substantially outperforms previous vision-based approaches without using any parallel sentences or supervision of seed word pairs.Comment: Accepted by AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Explore and Tell: Embodied Visual Captioning in 3D Environments

Author: Chen Shizhe
Hu Anwen
Jin Qin
Zhang Liang
Publication venue
Publication date: 20/08/2023
Field of study

While current visual captioning models have achieved impressive performance, they often assume that the image is well-captured and provides a complete view of the scene. In real-world scenarios, however, a single image may not offer a good viewpoint, hindering fine-grained scene understanding. To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints. Specifically, starting at a random viewpoint, an agent must navigate the environment to gather information from different viewpoints and generate a comprehensive paragraph describing all objects in the scene. To support this task, we build the ET-Cap dataset with Kubric simulator, consisting of 10K 3D scenes with cluttered objects and three annotated paragraphs per scene. We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task. The navigator predicts which actions to take in the environment, while the captioner generates a paragraph description based on the whole navigation trajectory. Extensive experiments demonstrate that our model outperforms other carefully designed baselines. Our dataset, codes and models are available at https://aim3-ruc.github.io/ExploreAndTell.Comment: 12 pages; 10 figures; ICCV 202

arXiv.org e-Print Archive