10,023 research outputs found
Im2Flow: Motion Hallucination from Static Images for Action Recognition
Existing methods to recognize actions in static images take the images at
their face value, learning the appearances---objects, scenes, and body
poses---that distinguish each action class. However, such models are deprived
of the rich dynamic structure and motions that also define human activity. We
propose an approach that hallucinates the unobserved future motion implied by a
single snapshot to help static-image action recognition. The key idea is to
learn a prior over short-term dynamics from thousands of unlabeled videos,
infer the anticipated optical flow on novel static images, and then train
discriminative models that exploit both streams of information. Our main
contributions are twofold. First, we devise an encoder-decoder convolutional
neural network and a novel optical flow encoding that can translate a static
image into an accurate flow map. Second, we show the power of hallucinated flow
for recognition, successfully transferring the learned motion into a standard
two-stream network for activity recognition. On seven datasets, we demonstrate
the power of the approach. It not only achieves state-of-the-art accuracy for
dense optical flow prediction, but also consistently enhances recognition of
actions and dynamic scenes.Comment: Published in CVPR 2018, project page:
http://vision.cs.utexas.edu/projects/im2flow
Mining Mid-level Features for Action Recognition Based on Effective Skeleton Representation
Recently, mid-level features have shown promising performance in computer
vision. Mid-level features learned by incorporating class-level information are
potentially more discriminative than traditional low-level local features. In
this paper, an effective method is proposed to extract mid-level features from
Kinect skeletons for 3D human action recognition. Firstly, the orientations of
limbs connected by two skeleton joints are computed and each orientation is
encoded into one of the 27 states indicating the spatial relationship of the
joints. Secondly, limbs are combined into parts and the limb's states are
mapped into part states. Finally, frequent pattern mining is employed to mine
the most frequent and relevant (discriminative, representative and
non-redundant) states of parts in continuous several frames. These parts are
referred to as Frequent Local Parts or FLPs. The FLPs allow us to build
powerful bag-of-FLP-based action representation. This new representation yields
state-of-the-art results on MSR DailyActivity3D and MSR ActionPairs3D
Learning a Pose Lexicon for Semantic Action Recognition
This paper presents a novel method for learning a pose lexicon comprising
semantic poses defined by textual instructions and their associated visual
poses defined by visual features. The proposed method simultaneously takes two
input streams, semantic poses and visual pose candidates, and statistically
learns a mapping between them to construct the lexicon. With the learned
lexicon, action recognition can be cast as the problem of finding the maximum
translation probability of a sequence of semantic poses given a stream of
visual pose candidates. Experiments evaluating pre-trained and zero-shot action
recognition conducted on MSRC-12 gesture and WorkoutSu-10 exercise datasets
were used to verify the efficacy of the proposed method.Comment: Accepted by the 2016 IEEE International Conference on Multimedia and
Expo (ICME 2016). 6 pages paper and 4 pages supplementary materia
Do less and achieve more: Training CNNs for action recognition utilizing action images from the Web
Recently, attempts have been made to collect millions of videos to train Convolutional Neural Network (CNN) models for action recognition in videos. However, curating such large-scale video datasets requires immense human labor, and training CNNs on millions of videos demands huge computational resources. In contrast, collecting action images from the Web is much easier and training on images requires much less computation. In addition, labeled web images tend to contain discriminative action poses, which highlight discriminative portions of a video’s temporal progression. Through extensive experiments, we explore the question of whether we can utilize web action images to train better CNN models for action recognition in videos. We collect 23.8K manually filtered images from the Web that depict the 101 actions in the UCF101 action video dataset. We show that by utilizing web action images along with videos in training, significant performance boosts of CNN models can be achieved. We also investigate the scalability of the process by leveraging crawled web images (unfiltered) for UCF101 and ActivityNet. Using unfiltered images we can achieve performance improvements that are on-par with using filtered images. This means we can further reduce annotation labor and easily scale-up to larger problems. We also shed light on an artifact of finetuning CNN models that reduces the effective parameters of the CNN and show that using web action images can significantly alleviate this problem.https://arxiv.org/pdf/1512.07155v1.pdfFirst author draf
- …