66,779 research outputs found
Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks
Human actions in video sequences are three-dimensional (3D) spatio-temporal
signals characterizing both the visual appearance and motion dynamics of the
involved humans and objects. Inspired by the success of convolutional neural
networks (CNN) for image classification, recent attempts have been made to
learn 3D CNNs for recognizing human actions in videos. However, partly due to
the high complexity of training 3D convolution kernels and the need for large
quantities of training videos, only limited success has been reported. This has
triggered us to investigate in this paper a new deep architecture which can
handle 3D signals more effectively. Specifically, we propose factorized
spatio-temporal convolutional networks (FstCN) that factorize the original 3D
convolution kernel learning as a sequential process of learning 2D spatial
kernels in the lower layers (called spatial convolutional layers), followed by
learning 1D temporal kernels in the upper layers (called temporal convolutional
layers). We introduce a novel transformation and permutation operator to make
factorization in FstCN possible. Moreover, to address the issue of sequence
alignment, we propose an effective training and inference strategy based on
sampling multiple video clips from a given action video sequence. We have
tested FstCN on two commonly used benchmark datasets (UCF-101 and HMDB-51).
Without using auxiliary training videos to boost the performance, FstCN
outperforms existing CNN based methods and achieves comparable performance with
a recent method that benefits from using auxiliary training videos
Unsupervised Learning from Video with Deep Neural Embeddings
Because of the rich dynamical structure of videos and their ubiquity in
everyday life, it is a natural idea that video data could serve as a powerful
unsupervised learning signal for training visual representations in deep neural
networks. However, instantiating this idea, especially at large scale, has
remained a significant artificial intelligence challenge. Here we present the
Video Instance Embedding (VIE) framework, which extends powerful recent
unsupervised loss functions for learning deep nonlinear embeddings to
multi-stream temporal processing architectures on large-scale video datasets.
We show that VIE-trained networks substantially advance the state of the art in
unsupervised learning from video datastreams, both for action recognition in
the Kinetics dataset, and object recognition in the ImageNet dataset. We show
that a hybrid model with both static and dynamic processing pathways is optimal
for both transfer tasks, and provide analyses indicating how the pathways
differ. Taken in context, our results suggest that deep neural embeddings are a
promising approach to unsupervised visual learning across a wide variety of
domains.Comment: To appear in CVPR 202
Recommended from our members
Healthcare Event and Activity Logging.
The health of patients in the intensive care unit (ICU) can change frequently and inexplicably. Crucial events and activities responsible for these changes often go unnoticed. This paper introduces healthcare event and action logging (HEAL) which automatically and unobtrusively monitors and reports on events and activities that occur in a medical ICU room. HEAL uses a multimodal distributed camera network to monitor and identify ICU activities and estimate sanitation-event qualifiers. At the core is a novel approach to infer person roles based on semantic interactions, a critical requirement in many healthcare settings where individuals' identities must not be identified. The proposed approach for activity representation identifies contextual aspects basis and estimates aspect weights for proper action representation and reconstruction. The flexibility of the proposed algorithms enables the identification of people roles by associating them with inferred interactions and detected activities. A fully working prototype system is developed, tested in a mock ICU room and then deployed in two ICU rooms at a community hospital, thus offering unique capabilities for data gathering and analytics. The proposed method achieves a role identification accuracy of 84% and a backtracking role identification of 79% for obscured roles using interaction and appearance features on real ICU data. Detailed experimental results are provided in the context of four event-sanitation qualifiers: clean, transmission, contamination, and unclean
Memory Warps for Learning Long-Term Online Video Representations
This paper proposes a novel memory-based online video representation that is
efficient, accurate and predictive. This is in contrast to prior works that
often rely on computationally heavy 3D convolutions, ignore actual motion when
aligning features over time, or operate in an off-line mode to utilize future
frames. In particular, our memory (i) holds the feature representation, (ii) is
spatially warped over time to compensate for observer and scene motions, (iii)
can carry long-term information, and (iv) enables predicting feature
representations in future frames. By exploring a variant that operates at
multiple temporal scales, we efficiently learn across even longer time
horizons. We apply our online framework to object detection in videos,
obtaining a large 2.3 times speed-up and losing only 0.9% mAP on ImageNet-VID
dataset, compared to prior works that even use future frames. Finally, we
demonstrate the predictive property of our representation in two novel
detection setups, where features are propagated over time to (i) significantly
enhance a real-time detector by more than 10% mAP in a multi-threaded online
setup and to (ii) anticipate objects in future frames
A Survey on Content-Aware Video Analysis for Sports
Sports data analysis is becoming increasingly large-scale, diversified, and
shared, but difficulty persists in rapidly accessing the most crucial
information. Previous surveys have focused on the methodologies of sports video
analysis from the spatiotemporal viewpoint instead of a content-based
viewpoint, and few of these studies have considered semantics. This study
develops a deeper interpretation of content-aware sports video analysis by
examining the insight offered by research into the structure of content under
different scenarios. On the basis of this insight, we provide an overview of
the themes particularly relevant to the research on content-aware systems for
broadcast sports. Specifically, we focus on the video content analysis
techniques applied in sportscasts over the past decade from the perspectives of
fundamentals and general review, a content hierarchical model, and trends and
challenges. Content-aware analysis methods are discussed with respect to
object-, event-, and context-oriented groups. In each group, the gap between
sensation and content excitement must be bridged using proper strategies. In
this regard, a content-aware approach is required to determine user demands.
Finally, the paper summarizes the future trends and challenges for sports video
analysis. We believe that our findings can advance the field of research on
content-aware video analysis for broadcast sports.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems
for Video Technology (TCSVT
Recommended from our members
Explainable and Advisable Learning for Self-driving Vehicles
Deep neural perception and control networks are likely to be a key component of self-driving vehicles. These models need to be explainable - they should provide easy-to-interpret rationales for their behavior - so that passengers, insurance companies, law enforcement, developers, etc., can understand what triggered a particular behavior. Explanations may be triggered by the neural controller, namely introspective explanations, or informed by the neural controller's output, namely rationalizations. Our work has focused on the challenge of generating introspective explanations of deep models for self-driving vehicles. In Chapter 3, we begin by exploring the use of visual explanations. These explanations take the form of real-time highlighted regions of an image that causally influence the network's output (steering control). In the first stage, we use a visual attention model to train a convolution network end-to-end from images to steering angle. The attention model highlights image regions that potentially influence the network's output. Some of these are true influences, but some are spurious. We then apply a causal filtering step to determine which input regions actually influence the output. This produces more succinct visual explanations and more accurately exposes the network's behavior. In Chapter 4, we add an attention-based video-to-text model to produce textual explanations of model actions, e.g. "the car slows down because the road is wet". The attention maps of controller and explanation model are aligned so that explanations are grounded in the parts of the scene that mattered to the controller. We explore two approaches to attention alignment, strong- and weak-alignment. These explainable systems represent an externalization of tacit knowledge. The network's opaque reasoning is simplified to a situation-specific dependence on a visible object in the image. This makes them brittle and potentially unsafe in situations that do not match training data. In Chapter 5, we propose to address this issue by augmenting training data with natural language advice from a human. Advice includes guidance about what to do and where to attend. We present the first step toward advice-giving, where we train an end-to-end vehicle controller that accepts advice. The controller adapts the way it attends to the scene (visual attention) and the control (steering and speed). Further, in Chapter 6, we propose a new approach that learns vehicle control with the help of long-term (global) human advice. Specifically, our system learns to summarize its visual observations in natural language, predict an appropriate action response (e.g. "I see a pedestrian crossing, so I stop"), and predict the controls, accordingly
DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks
We propose an action recognition framework using Gen- erative Adversarial
Networks. Our model involves train- ing a deep convolutional generative
adversarial network (DCGAN) using a large video activity dataset without la-
bel information. Then we use the trained discriminator from the GAN model as an
unsupervised pre-training step and fine-tune the trained discriminator model on
a labeled dataset to recognize human activities. We determine good network
architectural and hyperparameter settings for us- ing the discriminator from
DCGAN as a trained model to learn useful representations for action
recognition. Our semi-supervised framework using only appearance infor- mation
achieves superior or comparable performance to the current state-of-the-art
semi-supervised action recog- nition methods on two challenging video activity
datasets: UCF101 and HMDB51
CNN-based Action Recognition and Supervised Domain Adaptation on 3D Body Skeletons via Kernel Feature Maps
Deep learning is ubiquitous across many areas areas of computer vision. It
often requires large scale datasets for training before being fine-tuned on
small-to-medium scale problems. Activity, or, in other words, action
recognition, is one of many application areas of deep learning. While there
exist many Convolutional Neural Network architectures that work with the RGB
and optical flow frames, training on the time sequences of 3D body skeleton
joints is often performed via recurrent networks such as LSTM.
In this paper, we propose a new representation which encodes sequences of 3D
body skeleton joints in texture-like representations derived from
mathematically rigorous kernel methods. Such a representation becomes the first
layer in a standard CNN network e.g., ResNet-50, which is then used in the
supervised domain adaptation pipeline to transfer information from the source
to target dataset. This lets us leverage the available Kinect-based data beyond
training on a single dataset and outperform simple fine-tuning on any two
datasets combined in a naive manner. More specifically, in this paper we
utilize the overlapping classes between datasets. We associate datapoints of
the same class via so-called commonality, known from the supervised domain
adaptation. We demonstrate state-of-the-art results on three publicly available
benchmarks
Simultaneous Joint and Object Trajectory Templates for Human Activity Recognition from 3-D Data
The availability of low-cost range sensors and the development of relatively
robust algorithms for the extraction of skeleton joint locations have inspired
many researchers to develop human activity recognition methods using the 3-D
data. In this paper, an effective method for the recognition of human
activities from the normalized joint trajectories is proposed. We represent the
actions as multidimensional signals and introduce a novel method for generating
action templates by averaging the samples in a "dynamic time" sense. Then in
order to deal with the variations in the speed and style of performing actions,
we warp the samples to the action templates by an efficient algorithm and
employ wavelet filters to extract meaningful spatiotemporal features. The
proposed method is also capable of modeling the human-object interactions, by
performing the template generation and temporal warping procedure via the joint
and object trajectories simultaneously. The experimental evaluation on several
challenging datasets demonstrates the effectiveness of our method compared to
the state-of-the-arts
Shuffle and Learn: Unsupervised Learning using Temporal Order Verification
In this paper, we present an approach for learning a visual representation
from the raw spatiotemporal signals in videos. Our representation is learned
without supervision from semantic labels. We formulate our method as an
unsupervised sequential verification task, i.e., we determine whether a
sequence of frames from a video is in the correct temporal order. With this
simple task and no semantic labels, we learn a powerful visual representation
using a Convolutional Neural Network (CNN). The representation contains
complementary information to that learned from supervised image datasets like
ImageNet. Qualitative results show that our method captures information that is
temporally varying, such as human pose. When used as pre-training for action
recognition, our method gives significant gains over learning without external
data on benchmark datasets like UCF101 and HMDB51. To demonstrate its
sensitivity to human pose, we show results for pose estimation on the FLIC and
MPII datasets that are competitive, or better than approaches using
significantly more supervision. Our method can be combined with supervised
representations to provide an additional boost in accuracy.Comment: Accepted at ECCV 201
- …