11,864 research outputs found
Video Representation Learning by Dense Predictive Coding
The objective of this paper is self-supervised learning of spatio-temporal
embeddings from video, suitable for human action recognition. We make three
contributions: First, we introduce the Dense Predictive Coding (DPC) framework
for self-supervised representation learning on videos. This learns a dense
encoding of spatio-temporal blocks by recurrently predicting future
representations; Second, we propose a curriculum training scheme to predict
further into the future with progressively less temporal context. This
encourages the model to only encode slowly varying spatial-temporal signals,
therefore leading to semantic representations; Third, we evaluate the approach
by first training the DPC model on the Kinetics-400 dataset with
self-supervised learning, and then finetuning the representation on a
downstream task, i.e. action recognition. With single stream (RGB only), DPC
pretrained representations achieve state-of-the-art self-supervised performance
on both UCF101(75.7% top1 acc) and HMDB51(35.7% top1 acc), outperforming all
previous learning methods by a significant margin, and approaching the
performance of a baseline pre-trained on ImageNet
Memory-augmented Dense Predictive Coding for Video Representation Learning
The objective of this paper is self-supervised learning from video, in
particular for representations for action recognition. We make the following
contributions: (i) We propose a new architecture and learning framework
Memory-augmented Dense Predictive Coding (MemDPC) for the task. It is trained
with a predictive attention mechanism over the set of compressed memories, such
that any future states can always be constructed by a convex combination of the
condense representations, allowing to make multiple hypotheses efficiently.
(ii) We investigate visual-only self-supervised video representation learning
from RGB frames, or from unsupervised optical flow, or both. (iii) We
thoroughly evaluate the quality of learnt representation on four different
downstream tasks: action recognition, video retrieval, learning with scarce
annotations, and unintentional action classification. In all cases, we
demonstrate state-of-the-art or comparable performance over other approaches
with orders of magnitude fewer training data.Comment: ECCV2020, Spotligh
Going Deeper into Action Recognition: A Survey
Understanding human actions in visual data is tied to advances in
complementary research areas including object recognition, human dynamics,
domain adaptation and semantic segmentation. Over the last decade, human action
analysis evolved from earlier schemes that are often limited to controlled
environments to nowadays advanced solutions that can learn from millions of
videos and apply to almost all daily activities. Given the broad range of
applications from video surveillance to human-computer interaction, scientific
milestones in action recognition are achieved more rapidly, eventually leading
to the demise of what used to be good in a short time. This motivated us to
provide a comprehensive review of the notable steps taken towards recognizing
human actions. To this end, we start our discussion with the pioneering methods
that use handcrafted representations, and then, navigate into the realm of deep
learning based approaches. We aim to remain objective throughout this survey,
touching upon encouraging improvements as well as inevitable fallbacks, in the
hope of raising fresh questions and motivating new research directions for the
reader
Lip2AudSpec: Speech reconstruction from silent lip movements video
In this study, we propose a deep neural network for reconstructing
intelligible speech from silent lip movement videos. We use auditory
spectrogram as spectral representation of speech and its corresponding sound
generation method resulting in a more natural sounding reconstructed speech.
Our proposed network consists of an autoencoder to extract bottleneck features
from the auditory spectrogram which is then used as target to our main lip
reading network comprising of CNN, LSTM and fully connected layers. Our
experiments show that the autoencoder is able to reconstruct the original
auditory spectrogram with a 98% correlation and also improves the quality of
reconstructed speech from the main lip reading network. Our model, trained
jointly on different speakers is able to extract individual speaker
characteristics and gives promising results of reconstructing intelligible
speech with superior word recognition accuracy
Multi-View Frame Reconstruction with Conditional GAN
Multi-view frame reconstruction is an important problem particularly when
multiple frames are missing and past and future frames within the camera are
far apart from the missing ones. Realistic coherent frames can still be
reconstructed using corresponding frames from other overlapping cameras. We
propose an adversarial approach to learn the spatio-temporal representation of
the missing frame using conditional Generative Adversarial Network (cGAN). The
conditional input to each cGAN is the preceding or following frames within the
camera or the corresponding frames in other overlapping cameras, all of which
are merged together using a weighted average. Representations learned from
frames within the camera are given more weight compared to the ones learned
from other cameras when they are close to the missing frames and vice versa.
Experiments on two challenging datasets demonstrate that our framework produces
comparable results with the state-of-the-art reconstruction method in a single
camera and achieves promising performance in multi-camera scenario.Comment: 5 pages, 4 figures, 3 tables, Accepted at IEEE Global Conference on
Signal and Information Processing, 201
- …