60 research outputs found
Less than Few: Self-Shot Video Instance Segmentation
The goal of this paper is to bypass the need for labelled examples in
few-shot video understanding at run time. While proven effective, in many
practical video settings even labelling a few examples appears unrealistic.
This is especially true as the level of details in spatio-temporal video
understanding and with it, the complexity of annotations continues to increase.
Rather than performing few-shot learning with a human oracle to provide a few
densely labelled support videos, we propose to automatically learn to find
appropriate support videos given a query. We call this self-shot learning and
we outline a simple self-supervised learning method to generate an embedding
space well-suited for unsupervised retrieval of relevant samples. To showcase
this novel setting, we tackle, for the first time, video instance segmentation
in a self-shot (and few-shot) setting, where the goal is to segment instances
at the pixel-level across the spatial and temporal domains. We provide strong
baseline performances that utilize a novel transformer-based model and show
that self-shot learning can even surpass few-shot and can be positively
combined for further performance gains. Experiments on new benchmarks show that
our approach achieves strong performance, is competitive to oracle support in
some settings, scales to large unlabelled video collections, and can be
combined in a semi-supervised setting.Comment: 25 pages, 5 figures, 13 table
Learning Human Action Recognition Representations Without Real Humans
Pre-training on massive video datasets has become essential to achieve high
action recognition performance on smaller downstream datasets. However, most
large-scale video datasets contain images of people and hence are accompanied
with issues related to privacy, ethics, and data protection, often preventing
them from being publicly shared for reproducible research. Existing work has
attempted to alleviate these problems by blurring faces, downsampling videos,
or training on synthetic data. On the other hand, analysis on the
transferability of privacy-preserving pre-trained models to downstream tasks
has been limited. In this work, we study this problem by first asking the
question: can we pre-train models for human action recognition with data that
does not include real humans? To this end, we present, for the first time, a
benchmark that leverages real-world videos with humans removed and synthetic
data containing virtual humans to pre-train a model. We then evaluate the
transferability of the representation learned on this data to a diverse set of
downstream action recognition benchmarks. Furthermore, we propose a novel
pre-training strategy, called Privacy-Preserving MAE-Align, to effectively
combine synthetic data and human-removed real data. Our approach outperforms
previous baselines by up to 5% and closes the performance gap between human and
no-human action recognition representations on downstream tasks, for both
linear probing and fine-tuning. Our benchmark, code, and models are available
at https://github.com/howardzh01/PPMA .Comment: 19 pages, 7 figures, 2023 NeurIPS Datasets and Benchmarks Trac
Image-set, Temporal and Spatiotemporal Representations of Videos for Recognizing, Localizing and Quantifying Actions
This dissertation addresses the problem of learning video representations, which is defined here as transforming the video so that its essential structure is made more visible or accessible for action recognition and quantification. In the literature, a video can be represented by a set of images, by modeling motion or temporal dynamics, and by a 3D graph with pixels as nodes. This dissertation contributes in proposing a set of models to localize, track, segment, recognize and assess actions such as (1) image-set models via aggregating subset features given by regularizing normalized CNNs, (2) image-set models via inter-frame principal recovery and sparsely coding residual actions, (3) temporally local models with spatially global motion estimated by robust feature matching and local motion estimated by action detection with motion model added, (4) spatiotemporal models 3D graph and 3D CNN to model time as a space dimension, (5) supervised hashing by jointly learning embedding and quantization, respectively. State-of-the-art performances are achieved for tasks such as quantifying facial pain and human diving. Primary conclusions of this dissertation are categorized as follows: (i) Image set can capture facial actions that are about collective representation; (ii) Sparse and low-rank representations can have the expression, identity and pose cues untangled and can be learned via an image-set model and also a linear model; (iii) Norm is related with recognizability; similarity metrics and loss functions matter; (v) Combining the MIL based boosting tracker with the Particle Filter motion model induces a good trade-off between the appearance similarity and motion consistence; (iv) Segmenting object locally makes it amenable to assign shape priors; it is feasible to learn knowledge such as shape priors online from Web data with weak supervision; (v) It works locally in both space and time to represent videos as 3D graphs; 3D CNNs work effectively when inputted with temporally meaningful clips; (vi) the rich labeled images or videos help to learn better hash functions after learning binary embedded codes than the random projections. In addition, models proposed for videos can be adapted to other sequential images such as volumetric medical images which are not included in this dissertation
Simple and Complex Human Action Recognition in Constrained and Unconstrained Videos
Human action recognition plays a crucial role in visual learning applications such as video understanding and surveillance, video retrieval, human-computer interactions, and autonomous driving systems. A variety of methodologies have been proposed for human action recognition via developing of low-level features along with the bag-of-visual-word models. However, much less research has been performed on the compound of pre-processing, encoding and classification stages. This dissertation focuses on enhancing the action recognition performances via ensemble learning, hybrid classifier, hierarchical feature representation, and key action perception methodologies. Action variation is one of the crucial challenges in video analysis and action recognition. We address this problem by proposing the hybrid classifier (HC) to discriminate actions which contain similar forms of motion features such as walking, running, and jogging. Aside from that, we show and proof that the fusion of various appearance-based and motion features can boost the simple and complex action recognition performance. The next part of the dissertation introduces pooled-feature representation (PFR) which is derived from a double phase encoding framework (DPE). Considering that a given unconstrained video is composed of a sequence of simple frames, the first phase of DPE generates temporal sub-volumes from the video and represents them individually by employing the proposed improved rank pooling (IRP) method. The second phase constructs the pool of features by fusing the represented vectors from the first phase. The pool is compressed and then encoded to provide video-parts vector (VPV). The DPE framework allows distilling the video representation and hierarchically extracting new information. Compared with recent video encoding approaches, VPV can preserve the higher-level information through standard encoding of low-level features in two phases. Furthermore, the encoded vectors from both phases of DPE are fused along with a compression stage to develop PFR
- …