62,386 research outputs found
Per-Clip Video Object Segmentation
Recently, memory-based approaches show promising results on semi-supervised
video object segmentation. These methods predict object masks frame-by-frame
with the help of frequently updated memory of the previous mask. Different from
this per-frame inference, we investigate an alternative perspective by treating
video object segmentation as clip-wise mask propagation. In this per-clip
inference scheme, we update the memory with an interval and simultaneously
process a set of consecutive frames (i.e. clip) between the memory updates. The
scheme provides two potential benefits: accuracy gain by clip-level
optimization and efficiency gain by parallel computation of multiple frames. To
this end, we propose a new method tailored for the per-clip inference.
Specifically, we first introduce a clip-wise operation to refine the features
based on intra-clip correlation. In addition, we employ a progressive matching
mechanism for efficient information-passing within a clip. With the synergy of
two modules and a newly proposed per-clip based training, our network achieves
state-of-the-art performance on Youtube-VOS 2018/2019 val (84.6% and 84.6%) and
DAVIS 2016/2017 val (91.9% and 86.1%). Furthermore, our model shows a great
speed-accuracy trade-off with varying memory update intervals, which leads to
huge flexibility.Comment: CVPR 2022; Code is available at https://github.com/pkyong95/PCVO
Second-order Temporal Pooling for Action Recognition
Deep learning models for video-based action recognition usually generate
features for short clips (consisting of a few frames); such clip-level features
are aggregated to video-level representations by computing statistics on these
features. Typically zero-th (max) or the first-order (average) statistics are
used. In this paper, we explore the benefits of using second-order statistics.
Specifically, we propose a novel end-to-end learnable feature aggregation
scheme, dubbed temporal correlation pooling that generates an action descriptor
for a video sequence by capturing the similarities between the temporal
evolution of clip-level CNN features computed across the video. Such a
descriptor, while being computationally cheap, also naturally encodes the
co-activations of multiple CNN features, thereby providing a richer
characterization of actions than their first-order counterparts. We also
propose higher-order extensions of this scheme by computing correlations after
embedding the CNN features in a reproducing kernel Hilbert space. We provide
experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained
datasets such as MPII Cooking activities and JHMDB, as well as the recent
Kinetics-600. Our results demonstrate the advantages of higher-order pooling
schemes that when combined with hand-crafted features (as is standard practice)
achieves state-of-the-art accuracy.Comment: Accepted in the International Journal of Computer Vision (IJCV
CLIP/CETL Fellowship Report 2007/8 : Mentorship Scheme : extending work related learning
This project seeks to work with four parties; current second year students, alumni within the design industry, design professionals and PPD staff at LCC in a two stage process; setting up the requirements for a mentorship scheme and then investigating the outcomes. A handbook and website was produced
ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a
two-stage scheme. The general idea is to first generate class-agnostic region
proposals and then feed the cropped proposal regions to CLIP to utilize its
image-level zero-shot classification capability. While effective, such a scheme
requires two image encoders, one for proposal generation and one for CLIP,
leading to a complicated pipeline and high computational cost. In this work, we
pursue a simpler-and-efficient one-stage solution that directly extends CLIP's
zero-shot prediction capability from image to pixel level. Our investigation
starts with a straightforward extension as our baseline that generates semantic
masks by comparing the similarity between text and patch embeddings extracted
from CLIP. However, such a paradigm could heavily overfit the seen classes and
fail to generalize to unseen classes. To handle this issue, we propose three
simple-but-effective designs and figure out that they can significantly retain
the inherent zero-shot capacity of CLIP and improve pixel-level generalization
ability. Incorporating those modifications leads to an efficient zero-shot
semantic segmentation system called ZegCLIP. Through extensive experiments on
three public benchmarks, ZegCLIP demonstrates superior performance,
outperforming the state-of-the-art methods by a large margin under both
"inductive" and "transductive" zero-shot settings. In addition, compared with
the two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 times
faster during inference. We release the code at
https://github.com/ZiqinZhou66/ZegCLIP.git.Comment: 12 pages, 8 figure
Low-latency compression of mocap data using learned spatial decorrelation transform
Due to the growing needs of human motion capture (mocap) in movie, video
games, sports, etc., it is highly desired to compress mocap data for efficient
storage and transmission. This paper presents two efficient frameworks for
compressing human mocap data with low latency. The first framework processes
the data in a frame-by-frame manner so that it is ideal for mocap data
streaming and time critical applications. The second one is clip-based and
provides a flexible tradeoff between latency and compression performance. Since
mocap data exhibits some unique spatial characteristics, we propose a very
effective transform, namely learned orthogonal transform (LOT), for reducing
the spatial redundancy. The LOT problem is formulated as minimizing square
error regularized by orthogonality and sparsity and solved via alternating
iteration. We also adopt a predictive coding and temporal DCT for temporal
decorrelation in the frame- and clip-based frameworks, respectively.
Experimental results show that the proposed frameworks can produce higher
compression performance at lower computational cost and latency than the
state-of-the-art methods.Comment: 15 pages, 9 figure
- …