344,753 research outputs found
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
A Large-scale Varying-view RGB-D Action Dataset for Arbitrary-view Human Action Recognition
Current researches of action recognition mainly focus on single-view and
multi-view recognition, which can hardly satisfies the requirements of
human-robot interaction (HRI) applications to recognize actions from arbitrary
views. The lack of datasets also sets up barriers. To provide data for
arbitrary-view action recognition, we newly collect a large-scale RGB-D action
dataset for arbitrary-view action analysis, including RGB videos, depth and
skeleton sequences. The dataset includes action samples captured in 8 fixed
viewpoints and varying-view sequences which covers the entire 360 degree view
angles. In total, 118 persons are invited to act 40 action categories, and
25,600 video samples are collected. Our dataset involves more participants,
more viewpoints and a large number of samples. More importantly, it is the
first dataset containing the entire 360 degree varying-view sequences. The
dataset provides sufficient data for multi-view, cross-view and arbitrary-view
action analysis. Besides, we propose a View-guided Skeleton CNN (VS-CNN) to
tackle the problem of arbitrary-view action recognition. Experiment results
show that the VS-CNN achieves superior performance.Comment: Origianl version has been published by ACMMM 201
Multi-Camera Action Dataset for Cross-Camera Action Recognition Benchmarking
Action recognition has received increasing attention from the computer vision
and machine learning communities in the last decade. To enable the study of
this problem, there exist a vast number of action datasets, which are recorded
under controlled laboratory settings, real-world surveillance environments, or
crawled from the Internet. Apart from the "in-the-wild" datasets, the training
and test split of conventional datasets often possess similar environments
conditions, which leads to close to perfect performance on constrained
datasets. In this paper, we introduce a new dataset, namely Multi-Camera Action
Dataset (MCAD), which is designed to evaluate the open view classification
problem under the surveillance environment. In total, MCAD contains 14,298
action samples from 18 action categories, which are performed by 20 subjects
and independently recorded with 5 cameras. Inspired by the well received
evaluation approach on the LFW dataset, we designed a standard evaluation
protocol and benchmarked MCAD under several scenarios. The benchmark shows that
while an average of 85% accuracy is achieved under the closed-view scenario,
the performance suffers from a significant drop under the cross-view scenario.
In the worst case scenario, the performance of 10-fold cross validation drops
from 87.0% to 47.4%
Hierarchically Learned View-Invariant Representations for Cross-View Action Recognition
Recognizing human actions from varied views is challenging due to huge
appearance variations in different views. The key to this problem is to learn
discriminant view-invariant representations generalizing well across views. In
this paper, we address this problem by learning view-invariant representations
hierarchically using a novel method, referred to as Joint Sparse Representation
and Distribution Adaptation (JSRDA). To obtain robust and informative feature
representations, we first incorporate a sample-affinity matrix into the
marginalized stacked denoising Autoencoder (mSDA) to obtain shared features,
which are then combined with the private features. In order to make the feature
representations of videos across views transferable, we then learn a
transferable dictionary pair simultaneously from pairs of videos taken at
different views to encourage each action video across views to have the same
sparse representation. However, the distribution difference across views still
exists because a unified subspace where the sparse representations of one
action across views are the same may not exist when the view difference is
large. Therefore, we propose a novel unsupervised distribution adaptation
method that learns a set of projections that project the source and target
views data into respective low-dimensional subspaces where the marginal and
conditional distribution differences are reduced simultaneously. Therefore, the
finally learned feature representation is view-invariant and robust for
substantial distribution difference across views even the view difference is
large. Experimental results on four multiview datasets show that our approach
outperforms the state-ofthe-art approaches.Comment: Published in IEEE Transactions on Circuits and Systems for Video
Technology, codes can be found at https://yangliu9208.github.io/JSRDA
cvpaper.challenge in 2015 - A review of CVPR2015 and DeepSurvey
The "cvpaper.challenge" is a group composed of members from AIST, Tokyo Denki
Univ. (TDU), and Univ. of Tsukuba that aims to systematically summarize papers
on computer vision, pattern recognition, and related fields. For this
particular review, we focused on reading the ALL 602 conference papers
presented at the CVPR2015, the premier annual computer vision event held in
June 2015, in order to grasp the trends in the field. Further, we are proposing
"DeepSurvey" as a mechanism embodying the entire process from the reading
through all the papers, the generation of ideas, and to the writing of paper.Comment: Survey Pape
Space-Time Representation of People Based on 3D Skeletal Data: A Review
Spatiotemporal human representation based on 3D visual perception data is a
rapidly growing research area. Based on the information sources, these
representations can be broadly categorized into two groups based on RGB-D
information or 3D skeleton data. Recently, skeleton-based human representations
have been intensively studied and kept attracting an increasing attention, due
to their robustness to variations of viewpoint, human body scale and motion
speed as well as the realtime, online performance. This paper presents a
comprehensive survey of existing space-time representations of people based on
3D skeletal data, and provides an informative categorization and analysis of
these methods from the perspectives, including information modality,
representation encoding, structure and transition, and feature engineering. We
also provide a brief overview of skeleton acquisition devices and construction
methods, enlist a number of public benchmark datasets with skeleton data, and
discuss potential future research directions.Comment: Our paper has been accepted by the journal Computer Vision and Image
Understanding, see
http://www.sciencedirect.com/science/article/pii/S1077314217300279, Computer
Vision and Image Understanding, 201
Viewpoint Invariant Action Recognition using RGB-D Videos
In video-based action recognition, viewpoint variations often pose major
challenges because the same actions can appear different from different views.
We use the complementary RGB and Depth information from the RGB-D cameras to
address this problem. The proposed technique capitalizes on the spatio-temporal
information available in the two data streams to the extract action features
that are largely insensitive to the viewpoint variations. We use the RGB data
to compute dense trajectories that are translated to viewpoint insensitive deep
features under a non-linear knowledge transfer model. Similarly, the Depth
stream is used to extract CNN-based view invariant features on which Fourier
Temporal Pyramid is computed to incorporate the temporal information. The
heterogeneous features from the two streams are combined and used as a
dictionary to predict the label of the test samples. To that end, we propose a
sparse-dense collaborative representation classification scheme that strikes a
balance between the discriminative abilities of the dense and the sparse
representations of the samples over the extracted heterogeneous dictionary
PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding
Despite the fact that many 3D human activity benchmarks being proposed, most
existing action datasets focus on the action recognition tasks for the
segmented videos. There is a lack of standard large-scale benchmarks,
especially for current popular data-hungry deep learning based methods. In this
paper, we introduce a new large scale benchmark (PKU-MMD) for continuous
multi-modality 3D human action understanding and cover a wide range of complex
human activities with well annotated information. PKU-MMD contains 1076 long
video sequences in 51 action categories, performed by 66 subjects in three
camera views. It contains almost 20,000 action instances and 5.4 million frames
in total. Our dataset also provides multi-modality data sources, including RGB,
depth, Infrared Radiation and Skeleton. With different modalities, we conduct
extensive experiments on our dataset in terms of two scenarios and evaluate
different methods by various metrics, including a new proposed evaluation
protocol 2D-AP. We believe this large-scale dataset will benefit future
researches on action detection for the community.Comment: 10 page
What I See Is What You See: Joint Attention Learning for First and Third Person Video Co-analysis
In recent years, more and more videos are captured from the first-person
viewpoint by wearable cameras. Such first-person video provides additional
information besides the traditional third-person video, and thus has a wide
range of applications. However, techniques for analyzing the first-person video
can be fundamentally different from those for the third-person video, and it is
even more difficult to explore the shared information from both viewpoints. In
this paper, we propose a novel method for first- and third-person video
co-analysis. At the core of our method is the notion of "joint attention",
indicating the learnable representation that corresponds to the shared
attention regions in different viewpoints and thus links the two viewpoints. To
this end, we develop a multi-branch deep network with a triplet loss to extract
the joint attention from the first- and third-person videos via self-supervised
learning. We evaluate our method on the public dataset with cross-viewpoint
video matching tasks. Our method outperforms the state-of-the-art both
qualitatively and quantitatively. We also demonstrate how the learned joint
attention can benefit various applications through a set of additional
experiments
Recent Advances in Zero-shot Recognition
With the recent renaissance of deep convolution neural networks, encouraging
breakthroughs have been achieved on the supervised recognition tasks, where
each class has sufficient training data and fully annotated training data.
However, to scale the recognition to a large number of classes with few or now
training samples for each class remains an unsolved problem. One approach to
scaling up the recognition is to develop models capable of recognizing unseen
categories without any training instances, or zero-shot recognition/ learning.
This article provides a comprehensive review of existing zero-shot recognition
techniques covering various aspects ranging from representations of models, and
from datasets and evaluation settings. We also overview related recognition
tasks including one-shot and open set recognition which can be used as natural
extensions of zero-shot recognition when limited number of class samples become
available or when zero-shot recognition is implemented in a real-world setting.
Importantly, we highlight the limitations of existing approaches and point out
future research directions in this existing new research area.Comment: accepted by IEEE Signal Processing Magazin
- …