80,409 research outputs found
Semantic Image Networks for Human Action Recognition
In this paper, we propose the use of a semantic image, an improved
representation for video analysis, principally in combination with Inception
networks. The semantic image is obtained by applying localized sparse
segmentation using global clustering (LSSGC) prior to the approximate rank
pooling which summarizes the motion characteristics in single or multiple
images. It incorporates the background information by overlaying a static
background from the window onto the subsequent segmented frames. The idea is to
improve the action-motion dynamics by focusing on the region which is important
for action recognition and encoding the temporal variances using the frame
ranking method. We also propose the sequential combination of
Inception-ResNetv2 and long-short-term memory network (LSTM) to leverage the
temporal variances for improved recognition performance. Extensive analysis has
been carried out on UCF101 and HMDB51 datasets which are widely used in action
recognition studies. We show that (i) the semantic image generates better
activations and converges faster than its original variant, (ii) using
segmentation prior to approximate rank pooling yields better recognition
performance, (iii) The use of LSTM leverages the temporal variance information
from approximate rank pooling to model the action behavior better than the base
network, (iv) the proposed representations can be adaptive as they can be used
with existing methods such as temporal segment networks to improve the
recognition performance, and (v) our proposed four-stream network architecture
comprising of semantic images and semantic optical flows achieves
state-of-the-art performance, 95.9% and 73.5% recognition accuracy on UCF101
and HMDB51, respectively.Comment: 30 page
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
Large-scale labeled data are generally required to train deep neural networks
in order to obtain better performance in visual feature learning from images or
videos for computer vision applications. To avoid extensive cost of collecting
and annotating large-scale datasets, as a subset of unsupervised learning
methods, self-supervised learning methods are proposed to learn general image
and video features from large-scale unlabeled data without using any
human-annotated labels. This paper provides an extensive review of deep
learning-based self-supervised general visual feature learning methods from
images or videos. First, the motivation, general pipeline, and terminologies of
this field are described. Then the common deep neural network architectures
that used for self-supervised learning are summarized. Next, the main
components and evaluation metrics of self-supervised learning methods are
reviewed followed by the commonly used image and video datasets and the
existing self-supervised visual feature learning methods. Finally, quantitative
performance comparisons of the reviewed methods on benchmark datasets are
summarized and discussed for both image and video feature learning. At last,
this paper is concluded and lists a set of promising future directions for
self-supervised visual feature learning
Recurrent Models for Situation Recognition
This work proposes Recurrent Neural Network (RNN) models to predict
structured 'image situations' -- actions and noun entities fulfilling semantic
roles related to the action. In contrast to prior work relying on Conditional
Random Fields (CRFs), we use a specialized action prediction network followed
by an RNN for noun prediction. Our system obtains state-of-the-art accuracy on
the challenging recent imSitu dataset, beating CRF-based models, including ones
trained with additional data. Further, we show that specialized features
learned from situation prediction can be transferred to the task of image
captioning to more accurately describe human-object interactions.Comment: To appear at ICCV 201
A Survey on Content-Aware Video Analysis for Sports
Sports data analysis is becoming increasingly large-scale, diversified, and
shared, but difficulty persists in rapidly accessing the most crucial
information. Previous surveys have focused on the methodologies of sports video
analysis from the spatiotemporal viewpoint instead of a content-based
viewpoint, and few of these studies have considered semantics. This study
develops a deeper interpretation of content-aware sports video analysis by
examining the insight offered by research into the structure of content under
different scenarios. On the basis of this insight, we provide an overview of
the themes particularly relevant to the research on content-aware systems for
broadcast sports. Specifically, we focus on the video content analysis
techniques applied in sportscasts over the past decade from the perspectives of
fundamentals and general review, a content hierarchical model, and trends and
challenges. Content-aware analysis methods are discussed with respect to
object-, event-, and context-oriented groups. In each group, the gap between
sensation and content excitement must be bridged using proper strategies. In
this regard, a content-aware approach is required to determine user demands.
Finally, the paper summarizes the future trends and challenges for sports video
analysis. We believe that our findings can advance the field of research on
content-aware video analysis for broadcast sports.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems
for Video Technology (TCSVT
Temporal Unet: Sample Level Human Action Recognition using WiFi
Human doing actions will result in WiFi distortion, which is widely explored
for action recognition, such as the elderly fallen detection, hand sign
language recognition, and keystroke estimation. As our best survey, past work
recognizes human action by categorizing one complete distortion series into one
action, which we term as series-level action recognition. In this paper, we
introduce a much more fine-grained and challenging action recognition task into
WiFi sensing domain, i.e., sample-level action recognition. In this task, every
WiFi distortion sample in the whole series should be categorized into one
action, which is a critical technique in precise action localization,
continuous action segmentation, and real-time action recognition. To achieve
WiFi-based sample-level action recognition, we fully analyze approaches in
image-based semantic segmentation as well as in video-based frame-level action
recognition, then propose a simple yet efficient deep convolutional neural
network, i.e., Temporal Unet. Experimental results show that Temporal Unet
achieves this novel task well. Codes have been made publicly available at
https://github.com/geekfeiw/WiSLAR.Comment: 14 pages, 14 figures, 1 tabl
A Survey on Deep Learning Methods for Robot Vision
Deep learning has allowed a paradigm shift in pattern recognition, from using
hand-crafted features together with statistical classifiers to using
general-purpose learning procedures for learning data-driven representations,
features, and classifiers together. The application of this new paradigm has
been particularly successful in computer vision, in which the development of
deep learning methods for vision applications has become a hot research topic.
Given that deep learning has already attracted the attention of the robot
vision community, the main purpose of this survey is to address the use of deep
learning in robot vision. To achieve this, a comprehensive overview of deep
learning and its usage in computer vision is given, that includes a description
of the most frequently used neural models and their main application areas.
Then, the standard methodology and tools used for designing deep-learning based
vision systems are presented. Afterwards, a review of the principal work using
deep learning in robot vision is presented, as well as current and future
trends related to the use of deep learning in robotics. This survey is intended
to be a guide for the developers of robot vision systems
Generalized Zero-Shot Learning for Action Recognition with Web-Scale Video Data
Action recognition in surveillance video makes our life safer by detecting
the criminal events or predicting violent emergencies. However, efficient
action recognition is not free of difficulty. First, there are so many action
classes in daily life that we cannot pre-define all possible action classes
beforehand. Moreover, it is very hard to collect real-word videos for certain
particular actions such as steal and street fight due to legal restrictions and
privacy protection. These challenges make existing data-driven recognition
methods insufficient to attain desired performance. Zero-shot learning is
potential to be applied to solve these issues since it can perform
classification without positive example. Nevertheless, current zero-shot
learning algorithms have been studied under the unreasonable setting where seen
classes are absent during the testing phase. Motivated by this, we study the
task of action recognition in surveillance video under a more realistic
\emph{generalized zero-shot setting}, where testing data contains both seen and
unseen classes. To our best knowledge, this is the first work to study video
action recognition under the generalized zero-shot setting. We firstly perform
extensive empirical studies on several existing zero-shot leaning approaches
under this new setting on a web-scale video data. Our experimental results
demonstrate that, under the generalize setting, typical zero-shot learning
methods are no longer effective for the dataset we applied. Then, we propose a
method for action recognition by deploying generalized zero-shot learning,
which transfers the knowledge of web video to detect the anomalous actions in
surveillance videos. To verify the effectiveness of our proposed method, we
further construct a new surveillance video dataset consisting of nine action
classes related to the public safety situation
cvpaper.challenge in 2015 - A review of CVPR2015 and DeepSurvey
The "cvpaper.challenge" is a group composed of members from AIST, Tokyo Denki
Univ. (TDU), and Univ. of Tsukuba that aims to systematically summarize papers
on computer vision, pattern recognition, and related fields. For this
particular review, we focused on reading the ALL 602 conference papers
presented at the CVPR2015, the premier annual computer vision event held in
June 2015, in order to grasp the trends in the field. Further, we are proposing
"DeepSurvey" as a mechanism embodying the entire process from the reading
through all the papers, the generation of ideas, and to the writing of paper.Comment: Survey Pape
Recent Advances in Zero-shot Recognition
With the recent renaissance of deep convolution neural networks, encouraging
breakthroughs have been achieved on the supervised recognition tasks, where
each class has sufficient training data and fully annotated training data.
However, to scale the recognition to a large number of classes with few or now
training samples for each class remains an unsolved problem. One approach to
scaling up the recognition is to develop models capable of recognizing unseen
categories without any training instances, or zero-shot recognition/ learning.
This article provides a comprehensive review of existing zero-shot recognition
techniques covering various aspects ranging from representations of models, and
from datasets and evaluation settings. We also overview related recognition
tasks including one-shot and open set recognition which can be used as natural
extensions of zero-shot recognition when limited number of class samples become
available or when zero-shot recognition is implemented in a real-world setting.
Importantly, we highlight the limitations of existing approaches and point out
future research directions in this existing new research area.Comment: accepted by IEEE Signal Processing Magazin
- …