844 research outputs found
Improving Image Captioning by Leveraging Knowledge Graphs
We explore the use of a knowledge graphs, that capture general or commonsense
knowledge, to augment the information extracted from images by the
state-of-the-art methods for image captioning. The results of our experiments,
on several benchmark data sets such as MS COCO, as measured by CIDEr-D, a
performance metric for image captioning, show that the variants of the
state-of-the-art methods for image captioning that make use of the information
extracted from knowledge graphs can substantially outperform those that rely
solely on the information extracted from images.Comment: Accepted by WACV'1
Text with Knowledge Graph Augmented Transformer for Video Captioning
Video captioning aims to describe the content of videos using natural
language. Although significant progress has been made, there is still much room
to improve the performance for real-world applications, mainly due to the
long-tail words challenge. In this paper, we propose a text with knowledge
graph augmented transformer (TextKG) for video captioning. Notably, TextKG is a
two-stream transformer, formed by the external stream and internal stream. The
external stream is designed to absorb additional knowledge, which models the
interactions between the additional knowledge, e.g., pre-built knowledge graph,
and the built-in information of videos, e.g., the salient object regions,
speech transcripts, and video captions, to mitigate the long-tail words
challenge. Meanwhile, the internal stream is designed to exploit the
multi-modality information in videos (e.g., the appearance of video frames,
speech transcripts, and video captions) to ensure the quality of caption
results. In addition, the cross attention mechanism is also used in between the
two streams for sharing information. In this way, the two streams can help each
other for more accurate results. Extensive experiments conducted on four
challenging video captioning datasets, i.e., YouCookII, ActivityNet Captions,
MSRVTT, and MSVD, demonstrate that the proposed method performs favorably
against the state-of-the-art methods. Specifically, the proposed TextKG method
outperforms the best published results by improving 18.7% absolute CIDEr scores
on the YouCookII dataset.Comment: Accepted by CVPR202
Adapting Visual Question Answering Models for Enhancing Multimodal Community Q&A Platforms
Question categorization and expert retrieval methods have been crucial for
information organization and accessibility in community question & answering
(CQA) platforms. Research in this area, however, has dealt with only the text
modality. With the increasing multimodal nature of web content, we focus on
extending these methods for CQA questions accompanied by images. Specifically,
we leverage the success of representation learning for text and images in the
visual question answering (VQA) domain, and adapt the underlying concept and
architecture for automated category classification and expert retrieval on
image-based questions posted on Yahoo! Chiebukuro, the Japanese counterpart of
Yahoo! Answers.
To the best of our knowledge, this is the first work to tackle the
multimodality challenge in CQA, and to adapt VQA models for tasks on a more
ecologically valid source of visual questions. Our analysis of the differences
between visual QA and community QA data drives our proposal of novel
augmentations of an attention method tailored for CQA, and use of auxiliary
tasks for learning better grounding features. Our final model markedly
outperforms the text-only and VQA model baselines for both tasks of
classification and expert retrieval on real-world multimodal CQA data.Comment: Submitted for review at CIKM 201
Poet: Product-oriented Video Captioner for E-commerce
In e-commerce, a growing number of user-generated videos are used for product
promotion. How to generate video descriptions that narrate the user-preferred
product characteristics depicted in the video is vital for successful
promoting. Traditional video captioning methods, which focus on routinely
describing what exists and happens in a video, are not amenable for
product-oriented video captioning. To address this problem, we propose a
product-oriented video captioner framework, abbreviated as Poet. Poet firstly
represents the videos as product-oriented spatial-temporal graphs. Then, based
on the aspects of the video-associated product, we perform knowledge-enhanced
spatial-temporal inference on those graphs for capturing the dynamic change of
fine-grained product-part characteristics. The knowledge leveraging module in
Poet differs from the traditional design by performing knowledge filtering and
dynamic memory modeling. We show that Poet achieves consistent performance
improvement over previous methods concerning generation quality, product
aspects capturing, and lexical diversity. Experiments are performed on two
product-oriented video captioning datasets, buyer-generated fashion video
dataset (BFVD) and fan-generated fashion video dataset (FFVD), collected from
Mobile Taobao. We will release the desensitized datasets to promote further
investigations on both video captioning and general video analysis problems.Comment: 10 pages, 3 figures, to appear in ACM MM 2020 proceeding
A survey on knowledge-enhanced multimodal learning
Multimodal learning has been a field of increasing interest, aiming to
combine various modalities in a single joint representation. Especially in the
area of visiolinguistic (VL) learning multiple models and techniques have been
developed, targeting a variety of tasks that involve images and text. VL models
have reached unprecedented performances by extending the idea of Transformers,
so that both modalities can learn from each other. Massive pre-training
procedures enable VL models to acquire a certain level of real-world
understanding, although many gaps can be identified: the limited comprehension
of commonsense, factual, temporal and other everyday knowledge aspects
questions the extendability of VL tasks. Knowledge graphs and other knowledge
sources can fill those gaps by explicitly providing missing information,
unlocking novel capabilities of VL models. In the same time, knowledge graphs
enhance explainability, fairness and validity of decision making, issues of
outermost importance for such complex implementations. The current survey aims
to unify the fields of VL representation learning and knowledge graphs, and
provides a taxonomy and analysis of knowledge-enhanced VL models
- …