310,768 research outputs found
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
Scene Graph Generation (SGG) offers a structured representation critical in
many computer vision applications. Traditional SGG approaches, however, are
limited by a closed-set assumption, restricting their ability to recognize only
predefined object and relation categories. To overcome this, we categorize SGG
scenarios into four distinct settings based on the node and edge: Closed-set
SGG, Open Vocabulary (object) Detection-based SGG (OvD-SGG), Open Vocabulary
Relation-based SGG (OvR-SGG), and Open Vocabulary Detection + Relation-based
SGG (OvD+R-SGG). While object-centric open vocabulary SGG has been studied
recently, the more challenging problem of relation-involved open-vocabulary SGG
remains relatively unexplored. To fill this gap, we propose a unified framework
named OvSGTR towards fully open vocabulary SGG from a holistic view. The
proposed framework is an end-toend transformer architecture, which learns a
visual-concept alignment for both nodes and edges, enabling the model to
recognize unseen categories. For the more challenging settings of
relation-involved open vocabulary SGG, the proposed approach integrates
relation-aware pre-training utilizing image-caption data and retains
visual-concept alignment through knowledge distillation. Comprehensive
experimental results on the Visual Genome benchmark demonstrate the
effectiveness and superiority of the proposed framework.Comment: 10 pages, 4 figures, 6 table
Move Forward and Tell: A Progressive Generator of Video Descriptions
We present an efficient framework that can generate a coherent paragraph to
describe a given video. Previous works on video captioning usually focus on
video clips. They typically treat an entire video as a whole and generate the
caption conditioned on a single embedding. On the contrary, we consider videos
with rich temporal structures and aim to generate paragraph descriptions that
can preserve the story flow while being coherent and concise. Towards this
goal, we propose a new approach, which produces a descriptive paragraph by
assembling temporally localized descriptions. Given a video, it selects a
sequence of distinctive clips and generates sentences thereon in a coherent
manner. Particularly, the selection of clips and the production of sentences
are done jointly and progressively driven by a recurrent network -- what to
describe next depends on what have been said before. Here, the recurrent
network is learned via self-critical sequence training with both sentence-level
and paragraph-level rewards. On the ActivityNet Captions dataset, our method
demonstrated the capability of generating high-quality paragraph descriptions
for videos. Compared to those by other methods, the descriptions produced by
our method are often more relevant, more coherent, and more concise.Comment: Accepted by ECCV 201
Attention, Please! Adversarial Defense via Attention Rectification and Preservation
This study provides a new understanding of the adversarial attack problem by
examining the correlation between adversarial attack and visual attention
change. In particular, we observed that: (1) images with incomplete attention
regions are more vulnerable to adversarial attacks; and (2) successful
adversarial attacks lead to deviated and scattered attention map. Accordingly,
an attention-based adversarial defense framework is designed to simultaneously
rectify the attention map for prediction and preserve the attention area
between adversarial and original images. The problem of adding iteratively
attacked samples is also discussed in the context of visual attention change.
We hope the attention-related data analysis and defense solution in this study
will shed some light on the mechanism behind the adversarial attack and also
facilitate future adversarial defense/attack model design
Video Storytelling: Textual Summaries for Events
Bridging vision and natural language is a longstanding goal in computer
vision and multimedia research. While earlier works focus on generating a
single-sentence description for visual content, recent works have studied
paragraph generation. In this work, we introduce the problem of video
storytelling, which aims at generating coherent and succinct stories for long
videos. Video storytelling introduces new challenges, mainly due to the
diversity of the story and the length and complexity of the video. We propose
novel methods to address the challenges. First, we propose a context-aware
framework for multimodal embedding learning, where we design a Residual
Bidirectional Recurrent Neural Network to leverage contextual information from
past and future. Second, we propose a Narrator model to discover the underlying
storyline. The Narrator is formulated as a reinforcement learning agent which
is trained by directly optimizing the textual metric of the generated story. We
evaluate our method on the Video Story dataset, a new dataset that we have
collected to enable the study. We compare our method with multiple
state-of-the-art baselines, and show that our method achieves better
performance, in terms of quantitative measures and user study.Comment: Published in IEEE Transactions on Multimedi
Scene Graph Generation with External Knowledge and Image Reconstruction
Scene graph generation has received growing attention with the advancements
in image understanding tasks such as object detection, attributes and
relationship prediction,~\etc. However, existing datasets are biased in terms
of object and relationship labels, or often come with noisy and missing
annotations, which makes the development of a reliable scene graph prediction
model very challenging. In this paper, we propose a novel scene graph
generation algorithm with external knowledge and image reconstruction loss to
overcome these dataset issues. In particular, we extract commonsense knowledge
from the external knowledge base to refine object and phrase features for
improving generalizability in scene graph generation. To address the bias of
noisy object annotations, we introduce an auxiliary image reconstruction path
to regularize the scene graph generation network. Extensive experiments show
that our framework can generate better scene graphs, achieving the
state-of-the-art performance on two benchmark datasets: Visual Relationship
Detection and Visual Genome datasets.Comment: 10 pages, 5 figures, Accepted in CVPR 201
- …