159,270 research outputs found
Visual Affordance and Function Understanding: A Survey
Nowadays, robots are dominating the manufacturing, entertainment and
healthcare industries. Robot vision aims to equip robots with the ability to
discover information, understand it and interact with the environment. These
capabilities require an agent to effectively understand object affordances and
functionalities in complex visual domains. In this literature survey, we first
focus on Visual affordances and summarize the state of the art as well as open
problems and research gaps. Specifically, we discuss sub-problems such as
affordance detection, categorization, segmentation and high-level reasoning.
Furthermore, we cover functional scene understanding and the prevalent
functional descriptors used in the literature. The survey also provides
necessary background to the problem, sheds light on its significance and
highlights the existing challenges for affordance and functionality learning.Comment: 26 pages, 22 image
Human Action Recognition and Prediction: A Survey
Derived from rapid advances in computer vision and machine learning, video
analysis tasks have been moving from inferring the present state to predicting
the future state. Vision-based action recognition and prediction from videos
are such tasks, where action recognition is to infer human actions (present
state) based upon complete action executions, and action prediction to predict
human actions (future state) based upon incomplete action executions. These two
tasks have become particularly prevalent topics recently because of their
explosively emerging real-world applications, such as visual surveillance,
autonomous driving vehicle, entertainment, and video retrieval, etc. Many
attempts have been devoted in the last a few decades in order to build a robust
and effective framework for action recognition and prediction. In this paper,
we survey the complete state-of-the-art techniques in the action recognition
and prediction. Existing models, popular algorithms, technical difficulties,
popular action databases, evaluation protocols, and promising future directions
are also provided with systematic discussions
Space-Time Representation of People Based on 3D Skeletal Data: A Review
Spatiotemporal human representation based on 3D visual perception data is a
rapidly growing research area. Based on the information sources, these
representations can be broadly categorized into two groups based on RGB-D
information or 3D skeleton data. Recently, skeleton-based human representations
have been intensively studied and kept attracting an increasing attention, due
to their robustness to variations of viewpoint, human body scale and motion
speed as well as the realtime, online performance. This paper presents a
comprehensive survey of existing space-time representations of people based on
3D skeletal data, and provides an informative categorization and analysis of
these methods from the perspectives, including information modality,
representation encoding, structure and transition, and feature engineering. We
also provide a brief overview of skeleton acquisition devices and construction
methods, enlist a number of public benchmark datasets with skeleton data, and
discuss potential future research directions.Comment: Our paper has been accepted by the journal Computer Vision and Image
Understanding, see
http://www.sciencedirect.com/science/article/pii/S1077314217300279, Computer
Vision and Image Understanding, 201
Skeleton Focused Human Activity Recognition in RGB Video
The data-driven approach that learns an optimal representation of vision
features like skeleton frames or RGB videos is currently a dominant paradigm
for activity recognition. While great improvements have been achieved from
existing single modal approaches with increasingly larger datasets, the fusion
of various data modalities at the feature level has seldom been attempted. In
this paper, we propose a multimodal feature fusion model that utilizes both
skeleton and RGB modalities to infer human activity. The objective is to
improve the activity recognition accuracy by effectively utilizing the mutual
complemental information among different data modalities. For the skeleton
modality, we propose to use a graph convolutional subnetwork to learn the
skeleton representation. Whereas for the RGB modality, we will use the
spatial-temporal region of interest from RGB videos and take the attention
features from the skeleton modality to guide the learning process. The model
could be either individually or uniformly trained by the back-propagation
algorithm in an end-to-end manner. The experimental results for the NTU-RGB+D
and Northwestern-UCLA Multiview datasets achieved state-of-the-art performance,
which indicates that the proposed skeleton-driven attention mechanism for the
RGB modality increases the mutual communication between different data
modalities and brings more discriminative features for inferring human
activities.Comment: 8 page
Crowd Behavior Analysis: A Review where Physics meets Biology
Although the traits emerged in a mass gathering are often non-deliberative,
the act of mass impulse may lead to irre- vocable crowd disasters. The two-fold
increase of carnage in crowd since the past two decades has spurred significant
advances in the field of computer vision, towards effective and proactive crowd
surveillance. Computer vision stud- ies related to crowd are observed to
resonate with the understanding of the emergent behavior in physics (complex
systems) and biology (animal swarm). These studies, which are inspired by
biology and physics, share surprisingly common insights, and interesting
contradictions. However, this aspect of discussion has not been fully explored.
Therefore, this survey provides the readers with a review of the
state-of-the-art methods in crowd behavior analysis from the physics and
biologically inspired perspectives. We provide insights and comprehensive
discussions for a broader understanding of the underlying prospect of blending
physics and biology studies in computer vision.Comment: Accepted in Neurocomputing, 31 pages, 180 reference
A Deep Structured Model with Radius-Margin Bound for 3D Human Activity Recognition
Understanding human activity is very challenging even with the recently
developed 3D/depth sensors. To solve this problem, this work investigates a
novel deep structured model, which adaptively decomposes an activity instance
into temporal parts using the convolutional neural networks (CNNs). Our model
advances the traditional deep learning approaches in two aspects. First, { we
incorporate latent temporal structure into the deep model, accounting for large
temporal variations of diverse human activities. In particular, we utilize the
latent variables to decompose the input activity into a number of temporally
segmented sub-activities, and accordingly feed them into the parts (i.e.
sub-networks) of the deep architecture}. Second, we incorporate a radius-margin
bound as a regularization term into our deep model, which effectively improves
the generalization performance for classification. For model training, we
propose a principled learning algorithm that iteratively (i) discovers the
optimal latent variables (i.e. the ways of activity decomposition) for all
training instances, (ii) { updates the classifiers} based on the generated
features, and (iii) updates the parameters of multi-layer neural networks. In
the experiments, our approach is validated on several complex scenarios for
human activity recognition and demonstrates superior performances over other
state-of-the-art approaches.Comment: 16 pages, 9 figures, to appear in International Journal of Computer
Vision 201
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Person Identification with Visual Summary for a Safe Access to a Smart Home
SafeAccess is an integrated system designed to provide easier and safer
access to a smart home for people with or without disabilities. The system is
designed to enhance safety and promote the independence of people with
disability (i.e., visually impaired). The key functionality of the system
includes the detection and identification of human and generating contextual
visual summary from the real-time video streams obtained from the cameras
placed in strategic locations around the house. In addition, the system
classifies human into groups (i.e. friends/families/caregiver versus
intruders/burglars/unknown). These features allow the user to grant/deny remote
access to the premises or ability to call emergency services. In this paper, we
focus on designing a prototype system for the smart home and building a robust
recognition engine that meets the system criteria and addresses speed,
accuracy, deployment and environmental challenges under a wide variety of
practical and real-life situations. To interact with the system, we implemented
a dialog enabled interface to create a personalized profile using face images
or video of friend/families/caregiver. To improve computational efficiency, we
apply change detection to filter out frames and use Faster-RCNN to detect the
human presence and extract faces using Multitask Cascaded Convolutional
Networks (MTCNN). Subsequently, we apply LBP/FaceNet to identify a person and
groups by matching extracted faces with the profile. SafeAccess sends a visual
summary to the users with an MMS containing a person's name if any match found
or as "Unknown", scene image, facial description, and contextual information.
SafeAccess identifies friends/families/caregiver versus intruders/unknown with
an average F-score 0.97 and generates a visual summary from 10 classes with an
average accuracy of 98.01%
Understanding hand-object manipulation by modeling the contextual relationship between actions, grasp types and object attributes
This paper proposes a novel method for understanding daily hand-object
manipulation by developing computer vision-based techniques. Specifically, we
focus on recognizing hand grasp types, object attributes and manipulation
actions within an unified framework by exploring their contextual
relationships. Our hypothesis is that it is necessary to jointly model hands,
objects and actions in order to accurately recognize multiple tasks that are
correlated to each other in hand-object manipulation. In the proposed model, we
explore various semantic relationships between actions, grasp types and object
attributes, and show how the context can be used to boost the recognition of
each component. We also explore the spatial relationship between the hand and
object in order to detect the manipulated object from hand in cluttered
environment. Experiment results on all three recognition tasks show that our
proposed method outperforms traditional appearance-based methods which are not
designed to take into account contextual relationships involved in hand-object
manipulation. The visualization and generalizability study of the learned
context further supports our hypothesis.Comment: 14 pages, 13 figure
On Encoding Temporal Evolution for Real-time Action Prediction
Anticipating future actions is a key component of intelligence, specifically
when it applies to real-time systems, such as robots or autonomous cars. While
recent works have addressed prediction of raw RGB pixel values, we focus on
anticipating the motion evolution in future video frames. To this end, we
construct dynamic images (DIs) by summarising moving pixels through a sequence
of future frames. We train a convolutional LSTMs to predict the next DIs based
on an unsupervised learning process, and then recognise the activity associated
with the predicted DI. We demonstrate the effectiveness of our approach on 3
benchmark action datasets showing that despite running on videos with complex
activities, our approach is able to anticipate the next human action with high
accuracy and obtain better results than the state-of-the-art methods.Comment: Submitted Versio
- …