11 research outputs found
DCA: Diversified Co-Attention towards Informative Live Video Commenting
We focus on the task of Automatic Live Video Commenting (ALVC), which aims to
generate real-time video comments with both video frames and other viewers'
comments as inputs. A major challenge in this task is how to properly leverage
the rich and diverse information carried by video and text. In this paper, we
aim to collect diversified information from video and text for informative
comment generation. To achieve this, we propose a Diversified Co-Attention
(DCA) model for this task. Our model builds bidirectional interactions between
video frames and surrounding comments from multiple perspectives via metric
learning, to collect a diversified and informative context for comment
generation. We also propose an effective parameter orthogonalization technique
to avoid excessive overlap of information learned from different perspectives.
Results show that our approach outperforms existing methods in the ALVC task,
achieving new state-of-the-art results
LCCo: Lending CLIP to Co-Segmentation
This paper studies co-segmenting the common semantic object in a set of
images. Existing works either rely on carefully engineered networks to mine the
implicit semantic information in visual features or require extra data (i.e.,
classification labels) for training. In this paper, we leverage the contrastive
language-image pre-training framework (CLIP) for the task. With a backbone
segmentation network that independently processes each image from the set, we
introduce semantics from CLIP into the backbone features, refining them in a
coarse-to-fine manner with three key modules: i) an image set feature
correspondence module, encoding global consistent semantic information of the
image set; ii) a CLIP interaction module, using CLIP-mined common semantics of
the image set to refine the backbone feature; iii) a CLIP regularization
module, drawing CLIP towards this co-segmentation task, identifying the best
CLIP semantic and using it to regularize the backbone feature. Experiments on
four standard co-segmentation benchmark datasets show that the performance of
our method outperforms state-of-the-art methods
Deep Semantic Matching with Foreground Detection and Cycle-Consistency
Establishing dense semantic correspondences between object instances remains
a challenging problem due to background clutter, significant scale and pose
differences, and large intra-class variations. In this paper, we address weakly
supervised semantic matching based on a deep network where only image pairs
without manual keypoint correspondence annotations are provided. To facilitate
network training with this weaker form of supervision, we 1) explicitly
estimate the foreground regions to suppress the effect of background clutter
and 2) develop cycle-consistent losses to enforce the predicted transformations
across multiple images to be geometrically plausible and consistent. We train
the proposed model using the PF-PASCAL dataset and evaluate the performance on
the PF-PASCAL, PF-WILLOW, and TSS datasets. Extensive experimental results show
that the proposed approach performs favorably against the state-of-the-art
methods.Comment: ACCV 2018. PAMI 2020 extension: arXiv:1906.0585
Unsupervised and semi-supervised co-salient object detection via segmentation frequency statistics
In this paper, we address the detection of co-occurring salient objects
(CoSOD) in an image group using frequency statistics in an unsupervised manner,
which further enable us to develop a semi-supervised method. While previous
works have mostly focused on fully supervised CoSOD, less attention has been
allocated to detecting co-salient objects when limited segmentation annotations
are available for training. Our simple yet effective unsupervised method
US-CoSOD combines the object co-occurrence frequency statistics of unsupervised
single-image semantic segmentations with salient foreground detections using
self-supervised feature learning. For the first time, we show that a large
unlabeled dataset e.g. ImageNet-1k can be effectively leveraged to
significantly improve unsupervised CoSOD performance. Our unsupervised model is
a great pre-training initialization for our semi-supervised model SS-CoSOD,
especially when very limited labeled data is available for training. To avoid
propagating erroneous signals from predictions on unlabeled data, we propose a
confidence estimation module to guide our semi-supervised training. Extensive
experiments on three CoSOD benchmark datasets show that both of our
unsupervised and semi-supervised models outperform the corresponding
state-of-the-art models by a significant margin (e.g., on the Cosal2015
dataset, our US-CoSOD model has an 8.8% F-measure gain over a SOTA unsupervised
co-segmentation model and our SS-CoSOD model has an 11.81% F-measure gain over
a SOTA semi-supervised CoSOD model).Comment: Accepted at IEEE WACV 202