4,636 research outputs found
Localized Trajectories for 2D and 3D Action Recognition
The Dense Trajectories concept is one of the most successful approaches in action recognition, suitable for scenarios involving a significant amount of motion. However, due to noise and background motion, many generated trajectories are irrelevant to the actual human activity and can potentially lead to performance degradation. In this paper, we propose Localized Trajectories as an improved version of Dense Trajectories where motion trajectories are clustered around human body joints provided by RGB-D cameras and then encoded by local Bag-of-Words. As a result, the Localized Trajectories concept provides an advanced discriminative representation of actions. Moreover, we generalize Localized Trajectories to 3D by using the depth modality. One of the main advantages of 3D Localized Trajectories is that they describe radial displacements that are perpendicular to the image plane. Extensive experiments and analysis were carried out on five different datasets
Body Joint guided 3D Deep Convolutional Descriptors for Action Recognition
Three dimensional convolutional neural networks (3D CNNs) have been
established as a powerful tool to simultaneously learn features from both
spatial and temporal dimensions, which is suitable to be applied to video-based
action recognition. In this work, we propose not to directly use the
activations of fully-connected layers of a 3D CNN as the video feature, but to
use selective convolutional layer activations to form a discriminative
descriptor for video. It pools the feature on the convolutional layers under
the guidance of body joint positions. Two schemes of mapping body joints into
convolutional feature maps for pooling are discussed. The body joint positions
can be obtained from any off-the-shelf skeleton estimation algorithm. The
helpfulness of the body joint guided feature pooling with inaccurate skeleton
estimation is systematically evaluated. To make it end-to-end and do not rely
on any sophisticated body joint detection algorithm, we further propose a
two-stream bilinear model which can learn the guidance from the body joints and
capture the spatio-temporal features simultaneously. In this model, the body
joint guided feature pooling is conveniently formulated as a bilinear product
operation. Experimental results on three real-world datasets demonstrate the
effectiveness of body joint guided pooling which achieves promising
performance
Drive Video Analysis for the Detection of Traffic Near-Miss Incidents
Because of their recent introduction, self-driving cars and advanced driver
assistance system (ADAS) equipped vehicles have had little opportunity to
learn, the dangerous traffic (including near-miss incident) scenarios that
provide normal drivers with strong motivation to drive safely. Accordingly, as
a means of providing learning depth, this paper presents a novel traffic
database that contains information on a large number of traffic near-miss
incidents that were obtained by mounting driving recorders in more than 100
taxis over the course of a decade. The study makes the following two main
contributions: (i) In order to assist automated systems in detecting near-miss
incidents based on database instances, we created a large-scale traffic
near-miss incident database (NIDB) that consists of video clip of dangerous
events captured by monocular driving recorders. (ii) To illustrate the
applicability of NIDB traffic near-miss incidents, we provide two primary
database-related improvements: parameter fine-tuning using various near-miss
scenes from NIDB, and foreground/background separation into motion
representation. Then, using our new database in conjunction with a monocular
driving recorder, we developed a near-miss recognition method that provides
automated systems with a performance level that is comparable to a human-level
understanding of near-miss incidents (64.5% vs. 68.4% at near-miss recognition,
61.3% vs. 78.7% at near-miss detection).Comment: Accepted to ICRA 201
Efficient Action Detection in Untrimmed Videos via Multi-Task Learning
This paper studies the joint learning of action recognition and temporal
localization in long, untrimmed videos. We employ a multi-task learning
framework that performs the three highly related steps of action proposal,
action recognition, and action localization refinement in parallel instead of
the standard sequential pipeline that performs the steps in order. We develop a
novel temporal actionness regression module that estimates what proportion of a
clip contains action. We use it for temporal localization but it could have
other applications like video retrieval, surveillance, summarization, etc. We
also introduce random shear augmentation during training to simulate viewpoint
change. We evaluate our framework on three popular video benchmarks. Results
demonstrate that our joint model is efficient in terms of storage and
computation in that we do not need to compute and cache dense trajectory
features, and that it is several times faster than its sequential ConvNets
counterpart. Yet, despite being more efficient, it outperforms state-of-the-art
methods with respect to accuracy.Comment: WACV 2017 camera ready, minor updates about test time efficienc
Motion Guided 3D Pose Estimation from Videos
We propose a new loss function, called motion loss, for the problem of
monocular 3D Human pose estimation from 2D pose. In computing motion loss, a
simple yet effective representation for keypoint motion, called pairwise motion
encoding, is introduced. We design a new graph convolutional network
architecture, U-shaped GCN (UGCN). It captures both short-term and long-term
motion information to fully leverage the additional supervision from the motion
loss. We experiment training UGCN with the motion loss on two large scale
benchmarks: Human3.6M and MPI-INF-3DHP. Our model surpasses other
state-of-the-art models by a large margin. It also demonstrates strong capacity
in producing smooth 3D sequences and recovering keypoint motion
Rethinking the Faster R-CNN Architecture for Temporal Action Localization
We propose TAL-Net, an improved approach to temporal action localization in
video that is inspired by the Faster R-CNN object detection framework. TAL-Net
addresses three key shortcomings of existing approaches: (1) we improve
receptive field alignment using a multi-scale architecture that can accommodate
extreme variation in action durations; (2) we better exploit the temporal
context of actions for both proposal generation and action classification by
appropriately extending receptive fields; and (3) we explicitly consider
multi-stream feature fusion and demonstrate that fusing motion late is
important. We achieve state-of-the-art performance for both action proposal and
localization on THUMOS'14 detection benchmark and competitive performance on
ActivityNet challenge.Comment: Accepted in CVPR 201
Semantic Image Networks for Human Action Recognition
In this paper, we propose the use of a semantic image, an improved
representation for video analysis, principally in combination with Inception
networks. The semantic image is obtained by applying localized sparse
segmentation using global clustering (LSSGC) prior to the approximate rank
pooling which summarizes the motion characteristics in single or multiple
images. It incorporates the background information by overlaying a static
background from the window onto the subsequent segmented frames. The idea is to
improve the action-motion dynamics by focusing on the region which is important
for action recognition and encoding the temporal variances using the frame
ranking method. We also propose the sequential combination of
Inception-ResNetv2 and long-short-term memory network (LSTM) to leverage the
temporal variances for improved recognition performance. Extensive analysis has
been carried out on UCF101 and HMDB51 datasets which are widely used in action
recognition studies. We show that (i) the semantic image generates better
activations and converges faster than its original variant, (ii) using
segmentation prior to approximate rank pooling yields better recognition
performance, (iii) The use of LSTM leverages the temporal variance information
from approximate rank pooling to model the action behavior better than the base
network, (iv) the proposed representations can be adaptive as they can be used
with existing methods such as temporal segment networks to improve the
recognition performance, and (v) our proposed four-stream network architecture
comprising of semantic images and semantic optical flows achieves
state-of-the-art performance, 95.9% and 73.5% recognition accuracy on UCF101
and HMDB51, respectively.Comment: 30 page
ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos
Detecting human-object interactions (HOI) is an important step toward a
comprehensive visual understanding of machines. While detecting non-temporal
HOIs (e.g., sitting on a chair) from static images is feasible, it is unlikely
even for humans to guess temporal-related HOIs (e.g., opening/closing a door)
from a single video frame, where the neighboring frames play an essential role.
However, conventional HOI methods operating on only static images have been
used to predict temporal-related interactions, which is essentially guessing
without temporal contexts and may lead to sub-optimal performance. In this
paper, we bridge this gap by detecting video-based HOIs with explicit temporal
information. We first show that a naive temporal-aware variant of a common
action detection baseline does not work on video-based HOIs due to a
feature-inconsistency issue. We then propose a simple yet effective
architecture named Spatial-Temporal HOI Detection (ST-HOI) utilizing temporal
information such as human and object trajectories, correctly-localized visual
features, and spatial-temporal masking pose features. We construct a new video
HOI benchmark dubbed VidHOI where our proposed approach serves as a solid
baseline.Comment: Accepted at ACM ICMR'21 Workshop on Intelligent Cross-Data Analysis
and Retrieval. The dataset and source code are available at
https://github.com/coldmanck/VidHO
A Behavioral Approach to Visual Navigation with Graph Localization Networks
Inspired by research in psychology, we introduce a behavioral approach for
visual navigation using topological maps. Our goal is to enable a robot to
navigate from one location to another, relying only on its visual input and the
topological map of the environment. We propose using graph neural networks for
localizing the agent in the map, and decompose the action space into primitive
behaviors implemented as convolutional or recurrent neural networks. Using the
Gibson simulator, we verify that our approach outperforms relevant baselines
and is able to navigate in both seen and unseen environments.Comment: Video: https://youtu.be/nN3B1F90CF
Online Object and Task Learning via Human Robot Interaction
This work describes the development of a robotic system that acquires
knowledge incrementally through human interaction where new tools and motions
are taught on the fly. The robotic system developed was one of the five
finalists in the KUKA Innovation Award competition and demonstrated during the
Hanover Messe 2018 in Germany. The main contributions of the system are a) a
novel incremental object learning module - a deep learning based localization
and recognition system - that allows a human to teach new objects to the robot,
b) an intuitive user interface for specifying 3D motion task associated with
the new object, c) a hybrid force-vision control module for performing
compliant motion on an unstructured surface. This paper describes the
implementation and integration of the main modules of the system and summarizes
the lessons learned from the competition.Comment: 7 pages. ICRA1
- …